Project description

This project explores the prediction of loan default status using a machine learning classification approach. The notebook includes data preprocessing, exploratory data analysis, feature engineering, and the application of classification algorithms such as Logistic Regression and Random Forest. The goal is to accurately classify whether a loan applicant is likely to default, leveraging real-world financial data and evaluating model performance through metrics like accuracy and confusion matrix. The project provides insights into important predictive features and demonstrates a practical workflow for tackling binary classification problems in the financial domain.

Table of contents

  1. Data loading
  2. Preliminary Data Analysis
  3. Explanatory Data Analysis
  4. Modeling
  5. Dealing with Imbalanced Data
  6. Conclusions

Data loading

knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=TRUE)
install.packages(c("dplyr", "tidyr", "stringr", "formattable", "ggcorrplot", "ggplot2", "lubridate", "gridExtra", "readr", "scales", "fastDummies", "plotly", "reshape2", "randomForest", "caret", "RANN", "e1071", "MLmetrics", "xgboost", "DMwR2", "smotefamily", "glmnet", "rpart", "caTools", "progress", "stats"))
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))
library(dplyr)
library(tidyr)
library(stringr)
library(formattable)
library(ggcorrplot)
suppressWarnings({
  library(ggplot2)
  library(lubridate)
  library(gridExtra)
})
library(readr)
library(scales)
library(fastDummies)
library(plotly)
library(reshape2)
library(randomForest)
library(caret)
library(RANN)
library(e1071)
library(MLmetrics)
library(xgboost)
library(DMwR2) 
library(smotefamily)
library(glmnet)
library(rpart)
library(caTools)
library(progress)
library(stats)  
train <- read_csv("train.csv")
test  <- read.csv("test.csv")
df    <- bind_rows(train, test)

Preliminary Data Analysis

cat("Shape of training dataframe: ", dim(train), "\n")
## Shape of training dataframe:  233154 41
cat("Shape of testing dataframe: ", dim(test), "\n")
## Shape of testing dataframe:  112392 40
train <- train[!duplicated(train), ]
test <- test[!duplicated(test), ]

cat("Shape of training dataframe after removing duplicates: ", dim(train), "\n")
## Shape of training dataframe after removing duplicates:  233154 41
cat("Shape of testing dataframe after removing duplicates: ", dim(test), "\n")
## Shape of testing dataframe after removing duplicates:  112392 40
cat("Names of columns: ", colnames(train), "\n")
## Names of columns:  UNIQUEID DISBURSED_AMOUNT ASSET_COST LTV BRANCH_ID SUPPLIER_ID MANUFACTURER_ID CURRENT_PINCODE_ID DATE_OF_BIRTH EMPLOYMENT_TYPE DISBURSAL_DATE STATE_ID EMPLOYEE_CODE_ID MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG DRIVING_FLAG PASSPORT_FLAG PERFORM_CNS_SCORE PERFORM_CNS_SCORE_DESCRIPTION PRI_NO_OF_ACCTS PRI_ACTIVE_ACCTS PRI_OVERDUE_ACCTS PRI_CURRENT_BALANCE PRI_SANCTIONED_AMOUNT PRI_DISBURSED_AMOUNT SEC_NO_OF_ACCTS SEC_ACTIVE_ACCTS SEC_OVERDUE_ACCTS SEC_CURRENT_BALANCE SEC_SANCTIONED_AMOUNT SEC_DISBURSED_AMOUNT PRIMARY_INSTAL_AMT SEC_INSTAL_AMT NEW_ACCTS_IN_LAST_SIX_MONTHS DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS AVERAGE_ACCT_AGE CREDIT_HISTORY_LENGTH NO_OF_INQUIRIES LOAN_DEFAULT

We have 41 variables, no duplicates were detected. Overall, we have over 230 thousand observations in our training set. Let’s check whether the dataset has any missing values.

total <- nrow(train)
missing_data <- train %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column name", values_to = "Total missing") %>%
  mutate(`Percent missing` = (`Total missing` / total) * 100) %>%
  arrange(desc(`Total missing`))
print(missing_data)
## # A tibble: 41 × 3
##    `column name`      `Total missing` `Percent missing`
##    <chr>                        <int>             <dbl>
##  1 EMPLOYMENT_TYPE               7661              3.29
##  2 UNIQUEID                         0              0   
##  3 DISBURSED_AMOUNT                 0              0   
##  4 ASSET_COST                       0              0   
##  5 LTV                              0              0   
##  6 BRANCH_ID                        0              0   
##  7 SUPPLIER_ID                      0              0   
##  8 MANUFACTURER_ID                  0              0   
##  9 CURRENT_PINCODE_ID               0              0   
## 10 DATE_OF_BIRTH                    0              0   
## # ℹ 31 more rows

We can see that more that 3% of the ‘EMPLOYMENT_TYPE’ variable’s values is missing. We will treat the NA values as not employed.

print(unique(train$EMPLOYMENT_TYPE))
## [1] "Salaried"      "Self employed" NA
length(unique(train$EMPLOYMENT_TYPE))
## [1] 3

Let’s analyze the structure of the data set. We can see that most of the variables are numeric. We see that ‘AVERAGE_ACCT_AGE’ and ‘CREDIT_HISTORY_LENGHT’ can be changed from characters to numbers.

str(train)
## tibble [233,154 × 41] (S3: tbl_df/tbl/data.frame)
##  $ UNIQUEID                           : num [1:233154] 420825 537409 417566 624493 539055 ...
##  $ DISBURSED_AMOUNT                   : num [1:233154] 50578 47145 53278 57513 52378 ...
##  $ ASSET_COST                         : num [1:233154] 58400 65550 61360 66113 60300 ...
##  $ LTV                                : num [1:233154] 89.5 73.2 89.6 88.5 88.4 ...
##  $ BRANCH_ID                          : num [1:233154] 67 67 67 67 67 67 67 67 67 67 ...
##  $ SUPPLIER_ID                        : num [1:233154] 22807 22807 22807 22807 22807 ...
##  $ MANUFACTURER_ID                    : num [1:233154] 45 45 45 45 45 45 45 45 45 45 ...
##  $ CURRENT_PINCODE_ID                 : num [1:233154] 1441 1502 1497 1501 1495 ...
##  $ DATE_OF_BIRTH                      : chr [1:233154] "01-01-1984" "31-07-1985" "24-08-1985" "30-12-1993" ...
##  $ EMPLOYMENT_TYPE                    : chr [1:233154] "Salaried" "Self employed" "Self employed" "Self employed" ...
##  $ DISBURSAL_DATE                     : chr [1:233154] "03-08-2018" "26-09-2018" "01-08-2018" "26-10-2018" ...
##  $ STATE_ID                           : num [1:233154] 6 6 6 6 6 6 6 6 6 6 ...
##  $ EMPLOYEE_CODE_ID                   : num [1:233154] 1998 1998 1998 1998 1998 ...
##  $ MOBILENO_AVL_FLAG                  : num [1:233154] 1 1 1 1 1 1 1 1 1 1 ...
##  $ AADHAR_FLAG                        : num [1:233154] 1 1 1 1 1 1 1 1 1 0 ...
##  $ PAN_FLAG                           : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VOTERID_FLAG                       : num [1:233154] 0 0 0 0 0 0 0 0 0 1 ...
##  $ DRIVING_FLAG                       : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PASSPORT_FLAG                      : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PERFORM_CNS_SCORE                  : num [1:233154] 0 598 0 305 0 825 0 17 718 818 ...
##  $ PERFORM_CNS_SCORE_DESCRIPTION      : chr [1:233154] "No Bureau History Available" "I-Medium Risk" "No Bureau History Available" "L-Very High Risk" ...
##  $ PRI_NO_OF_ACCTS                    : num [1:233154] 0 1 0 3 0 2 0 1 1 1 ...
##  $ PRI_ACTIVE_ACCTS                   : num [1:233154] 0 1 0 0 0 0 0 1 1 0 ...
##  $ PRI_OVERDUE_ACCTS                  : num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
##  $ PRI_CURRENT_BALANCE                : num [1:233154] 0 27600 0 0 0 ...
##  $ PRI_SANCTIONED_AMOUNT              : num [1:233154] 0 50200 0 0 0 ...
##  $ PRI_DISBURSED_AMOUNT               : num [1:233154] 0 50200 0 0 0 ...
##  $ SEC_NO_OF_ACCTS                    : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_ACTIVE_ACCTS                   : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_OVERDUE_ACCTS                  : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_CURRENT_BALANCE                : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_SANCTIONED_AMOUNT              : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_DISBURSED_AMOUNT               : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRIMARY_INSTAL_AMT                 : num [1:233154] 0 1991 0 31 0 ...
##  $ SEC_INSTAL_AMT                     : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ NEW_ACCTS_IN_LAST_SIX_MONTHS       : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS: num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
##  $ AVERAGE_ACCT_AGE                   : chr [1:233154] "0yrs 0mon" "1yrs 11mon" "0yrs 0mon" "0yrs 8mon" ...
##  $ CREDIT_HISTORY_LENGTH              : chr [1:233154] "0yrs 0mon" "1yrs 11mon" "0yrs 0mon" "1yrs 3mon" ...
##  $ NO_OF_INQUIRIES                    : num [1:233154] 0 0 0 1 1 0 0 0 1 0 ...
##  $ LOAN_DEFAULT                       : num [1:233154] 0 1 0 1 1 0 0 0 0 0 ...
parse_age <- function(x) {
  years <- as.numeric(str_extract(x, "\\d+(?=yrs)"))
  months <- as.numeric(str_extract(x, "\\d+(?=mon)"))
  years[is.na(years)] <- 0
  months[is.na(months)] <- 0
  years + months / 12
}

train <- train %>%
  mutate(AVERAGE_ACCT_AGE = parse_age(AVERAGE_ACCT_AGE),
         CREDIT_HISTORY_LENGTH = parse_age(CREDIT_HISTORY_LENGTH))

test <- test %>%
  mutate(AVERAGE_ACCT_AGE = parse_age(AVERAGE_ACCT_AGE),
         CREDIT_HISTORY_LENGTH = parse_age(CREDIT_HISTORY_LENGTH))

Additionally, we changed the format of variables ‘DATE_OF_BIRTH’ and ‘DISBURSAL_DATE’ form object to date.

train$DATE_OF_BIRTH <- as.Date(train$DATE_OF_BIRTH, format = "%d-%m-%Y")
test$DATE_OF_BIRTH  <- as.Date(test$DATE_OF_BIRTH,  format = "%d-%m-%Y")

train$DISBURSAL_DATE <- as.Date(train$DISBURSAL_DATE, format = "%d-%m-%Y")
test$DISBURSAL_DATE  <- as.Date(test$DISBURSAL_DATE,  format = "%d-%m-%Y")

Finally, we obtained the dataset shaped like this:

str(train)
## tibble [233,154 × 41] (S3: tbl_df/tbl/data.frame)
##  $ UNIQUEID                           : num [1:233154] 420825 537409 417566 624493 539055 ...
##  $ DISBURSED_AMOUNT                   : num [1:233154] 50578 47145 53278 57513 52378 ...
##  $ ASSET_COST                         : num [1:233154] 58400 65550 61360 66113 60300 ...
##  $ LTV                                : num [1:233154] 89.5 73.2 89.6 88.5 88.4 ...
##  $ BRANCH_ID                          : num [1:233154] 67 67 67 67 67 67 67 67 67 67 ...
##  $ SUPPLIER_ID                        : num [1:233154] 22807 22807 22807 22807 22807 ...
##  $ MANUFACTURER_ID                    : num [1:233154] 45 45 45 45 45 45 45 45 45 45 ...
##  $ CURRENT_PINCODE_ID                 : num [1:233154] 1441 1502 1497 1501 1495 ...
##  $ DATE_OF_BIRTH                      : Date[1:233154], format: "1984-01-01" "1985-07-31" ...
##  $ EMPLOYMENT_TYPE                    : chr [1:233154] "Salaried" "Self employed" "Self employed" "Self employed" ...
##  $ DISBURSAL_DATE                     : Date[1:233154], format: "2018-08-03" "2018-09-26" ...
##  $ STATE_ID                           : num [1:233154] 6 6 6 6 6 6 6 6 6 6 ...
##  $ EMPLOYEE_CODE_ID                   : num [1:233154] 1998 1998 1998 1998 1998 ...
##  $ MOBILENO_AVL_FLAG                  : num [1:233154] 1 1 1 1 1 1 1 1 1 1 ...
##  $ AADHAR_FLAG                        : num [1:233154] 1 1 1 1 1 1 1 1 1 0 ...
##  $ PAN_FLAG                           : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VOTERID_FLAG                       : num [1:233154] 0 0 0 0 0 0 0 0 0 1 ...
##  $ DRIVING_FLAG                       : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PASSPORT_FLAG                      : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PERFORM_CNS_SCORE                  : num [1:233154] 0 598 0 305 0 825 0 17 718 818 ...
##  $ PERFORM_CNS_SCORE_DESCRIPTION      : chr [1:233154] "No Bureau History Available" "I-Medium Risk" "No Bureau History Available" "L-Very High Risk" ...
##  $ PRI_NO_OF_ACCTS                    : num [1:233154] 0 1 0 3 0 2 0 1 1 1 ...
##  $ PRI_ACTIVE_ACCTS                   : num [1:233154] 0 1 0 0 0 0 0 1 1 0 ...
##  $ PRI_OVERDUE_ACCTS                  : num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
##  $ PRI_CURRENT_BALANCE                : num [1:233154] 0 27600 0 0 0 ...
##  $ PRI_SANCTIONED_AMOUNT              : num [1:233154] 0 50200 0 0 0 ...
##  $ PRI_DISBURSED_AMOUNT               : num [1:233154] 0 50200 0 0 0 ...
##  $ SEC_NO_OF_ACCTS                    : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_ACTIVE_ACCTS                   : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_OVERDUE_ACCTS                  : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_CURRENT_BALANCE                : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_SANCTIONED_AMOUNT              : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEC_DISBURSED_AMOUNT               : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRIMARY_INSTAL_AMT                 : num [1:233154] 0 1991 0 31 0 ...
##  $ SEC_INSTAL_AMT                     : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ NEW_ACCTS_IN_LAST_SIX_MONTHS       : num [1:233154] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS: num [1:233154] 0 1 0 0 0 0 0 0 0 0 ...
##  $ AVERAGE_ACCT_AGE                   : num [1:233154] 0 1.917 0 0.667 0 ...
##  $ CREDIT_HISTORY_LENGTH              : num [1:233154] 0 1.92 0 1.25 0 ...
##  $ NO_OF_INQUIRIES                    : num [1:233154] 0 0 0 1 1 0 0 0 1 0 ...
##  $ LOAN_DEFAULT                       : num [1:233154] 0 1 0 1 1 0 0 0 0 0 ...

Explanatory Data Analysis

Now we can proceed with the EDA. Let’s start with

Class distribution

We will analyze the distribution of the target variable ‘LOAN_DEFAULT’.

class_df <- train %>%
  group_by(LOAN_DEFAULT) %>%
  summarise(UNIQUEID_count = n()) %>%
  arrange(desc(UNIQUEID_count))

formattable(class_df, list( UNIQUEID_count = color_bar("lightgreen") ))
LOAN_DEFAULT UNIQUEID_count
0 182543
1 50611
colors <- c("0" = "deepskyblue", "1" = "deeppink")

ggplot(train, aes(x = factor(LOAN_DEFAULT), fill = factor(LOAN_DEFAULT))) +
  geom_bar() +
  scale_fill_manual(values = colors) +
  labs(title = "Class Distribution", x = "Loan Default", y = "Count") +
  theme_minimal()

count_default_0 <- sum(train$LOAN_DEFAULT == 0, na.rm = TRUE)
count_default_1 <- sum(train$LOAN_DEFAULT == 1, na.rm = TRUE)
total <- count_default_0 + count_default_1

percentage_0 <- (count_default_0 / total) * 100
percentage_1 <- (count_default_1 / total) * 100

cat(sprintf("%% of no defaults       : %.2f%%\n", percentage_0))
## % of no defaults       : 78.29%
cat(sprintf("Number of no defaults  : %d\n", count_default_0))
## Number of no defaults  : 182543
cat(sprintf("%% of defaults          : %.2f%%\n", percentage_1))
## % of defaults          : 21.71%
cat(sprintf("Number of defaults     : %d\n", count_default_1))
## Number of defaults     : 50611

Also, let’s analyze the distribution between other categorical variables, such as ‘EMPLOYMENT_TYPE’, ‘MOBILENO_AVL_FLAG’, ‘AADHAR_FLAG’, ‘PAN_FLAG’, ‘VOTERID_FLAG’, ‘DRIVING_FLAG’ and ‘PASSPORT_FLAG’ ‘EMPLOYMENT_TYPE’

train %>%
  group_by(EMPLOYMENT_TYPE, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(EMPLOYMENT_TYPE) %>%
  mutate(prop = count / sum(count)) %>%
  arrange(EMPLOYMENT_TYPE, LOAN_DEFAULT) %>%
  print(n = Inf)
## # A tibble: 6 × 4
## # Groups:   EMPLOYMENT_TYPE [3]
##   EMPLOYMENT_TYPE LOAN_DEFAULT count  prop
##   <chr>                  <dbl> <int> <dbl>
## 1 Salaried                   0 77948 0.797
## 2 Salaried                   1 19910 0.203
## 3 Self employed              0 98578 0.772
## 4 Self employed              1 29057 0.228
## 5 <NA>                       0  6017 0.785
## 6 <NA>                       1  1644 0.215

‘MOBILENO_AVL_FLAG’

train %>%
  group_by(MOBILENO_AVL_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(MOBILENO_AVL_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 2 × 4
## # Groups:   MOBILENO_AVL_FLAG [1]
##   MOBILENO_AVL_FLAG LOAN_DEFAULT  count percentage
##               <dbl>        <dbl>  <int>      <dbl>
## 1                 1            0 182543      0.783
## 2                 1            1  50611      0.217

‘AADHAR_FLAG’

train %>%
  group_by(AADHAR_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(AADHAR_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   AADHAR_FLAG [2]
##   AADHAR_FLAG LOAN_DEFAULT  count percentage
##         <dbl>        <dbl>  <int>      <dbl>
## 1           0            0  27684      0.744
## 2           0            1   9546      0.256
## 3           1            0 154859      0.790
## 4           1            1  41065      0.210

*‘PAN_FLAG’

train %>%
  group_by(PAN_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(PAN_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   PAN_FLAG [2]
##   PAN_FLAG LOAN_DEFAULT  count percentage
##      <dbl>        <dbl>  <int>      <dbl>
## 1        0            0 168799      0.783
## 2        0            1  46734      0.217
## 3        1            0  13744      0.780
## 4        1            1   3877      0.220

‘VOTERID_FLAG’

train %>%
  group_by(VOTERID_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(VOTERID_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   VOTERID_FLAG [2]
##   VOTERID_FLAG LOAN_DEFAULT  count percentage
##          <dbl>        <dbl>  <int>      <dbl>
## 1            0            0 157565      0.790
## 2            0            1  41795      0.210
## 3            1            0  24978      0.739
## 4            1            1   8816      0.261

‘DRIVING_FLAG’

train %>%
  group_by(DRIVING_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(DRIVING_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   DRIVING_FLAG [2]
##   DRIVING_FLAG LOAN_DEFAULT  count percentage
##          <dbl>        <dbl>  <int>      <dbl>
## 1            0            0 178216      0.783
## 2            0            1  49519      0.217
## 3            1            0   4327      0.798
## 4            1            1   1092      0.202

‘PASSPORT_FLAG’

train %>%
  group_by(PASSPORT_FLAG, LOAN_DEFAULT) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(PASSPORT_FLAG) %>%
  mutate(percentage = count / sum(count)) %>%
  print()
## # A tibble: 4 × 4
## # Groups:   PASSPORT_FLAG [2]
##   PASSPORT_FLAG LOAN_DEFAULT  count percentage
##           <dbl>        <dbl>  <int>      <dbl>
## 1             0            0 182121      0.783
## 2             0            1  50537      0.217
## 3             1            0    422      0.851
## 4             1            1     74      0.149

Combined

train %>%
  group_by(LOAN_DEFAULT, EMPLOYMENT_TYPE, AADHAR_FLAG, PAN_FLAG, DRIVING_FLAG, PASSPORT_FLAG, VOTERID_FLAG) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count)) %>%
  print()
## # A tibble: 103 × 8
##    LOAN_DEFAULT EMPLOYMENT_TYPE AADHAR_FLAG PAN_FLAG DRIVING_FLAG PASSPORT_FLAG
##           <dbl> <chr>                 <dbl>    <dbl>        <dbl>         <dbl>
##  1            0 Self employed             1        0            0             0
##  2            0 Salaried                  1        0            0             0
##  3            1 Self employed             1        0            0             0
##  4            1 Salaried                  1        0            0             0
##  5            0 Self employed             0        0            0             0
##  6            0 Salaried                  0        0            0             0
##  7            0 <NA>                      1        0            0             0
##  8            1 Self employed             0        0            0             0
##  9            0 Salaried                  1        1            0             0
## 10            0 Self employed             1        1            0             0
## # ℹ 93 more rows
## # ℹ 2 more variables: VOTERID_FLAG <dbl>, count <int>

We can observe that the distribution between defaulted and non-defaulted client across all analyzed variables fluctuates around 21/79.

Default vs. Disbursal date

train$LOAN_DEFAULT <- as.factor(train$LOAN_DEFAULT)

ggplot(train, aes(x = DISBURSAL_DATE, fill = LOAN_DEFAULT)) +
  geom_histogram(data = subset(train, LOAN_DEFAULT == 1), 
                 bins = 50, alpha = 0.8, fill = "deeppink") +
  geom_histogram(data = subset(train, LOAN_DEFAULT == 0), 
                 bins = 50, alpha = 0.8, fill = "deepskyblue") +
  facet_wrap(~LOAN_DEFAULT, ncol = 1, scales = "free_y",
             labeller = as_labeller(c(`0` = "No default", `1` = "Default"))) +
  labs(x = "DISBURSAL DATE", y = "Number of Loans") +
  theme_bw() +
  theme(legend.position = "none")

Outlier Treatment

First let’s define a couple of useful functions that will help us quickly perform univariate analysis.

# Plot distribution of one feature

plot_distribution <- function(feature, color = "steelblue") {
ggplot(train, aes_string(x = feature)) + geom_histogram(aes(y =
..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
geom_density(color = "black", size = 1, na.rm = TRUE) + labs(title =
paste("Distribution of", feature), x = feature, y = "Density") +
theme_minimal() }
# Plot distribution bar plot of several features
plot_bar_comp <- function(var, nrow = 2) {
  plot_list <- list()
  
  for (feature in var) {
    p <- ggplot(train, aes_string(x = feature)) +
      geom_bar(fill = "skyblue") +
      labs(x = feature, y = "Count plot") +
      theme_minimal(base_size = 12)
    plot_list[[length(plot_list) + 1]] <- p
  }
  
  ncol <- 2
  do.call(grid.arrange, c(plot_list, nrow = nrow, ncol = ncol))
}
# Plot distribution plot of several features
plot_distribution_comp <- function(data, variables, n_rows = 3, n_cols = 2) {
  plot_list <- list()
  
  for (var in variables) {
    # Check if variable has very large values that need scientific notation
    scientific <- max(data[[var]], na.rm = TRUE) > 1e6
    
    p <- ggplot(data, aes_string(x = var, color = "factor(LOAN_DEFAULT)", 
                                 fill = "factor(LOAN_DEFAULT)")) +
      geom_density(alpha = 0.2) +
      scale_color_manual(name = "",
                       values = c("0" = "blue", "1" = "red"),
                       labels = c("LOAN_DEFAULT = 0", "LOAN_DEFAULT = 1")) +
      scale_fill_manual(name = "",
                       values = c("0" = "blue", "1" = "red"),
                       labels = c("LOAN_DEFAULT = 0", "LOAN_DEFAULT = 1")) +
      labs(x = var, y = "Density plot") +
      theme_bw() +
      theme(panel.grid.major = element_line(color = "grey90"),
            panel.grid.minor = element_line(color = "grey90"),
            legend.position = "bottom",
            legend.box = "horizontal",
            plot.title = element_text(size = 11))
    
    # Use scientific notation for large value variables
    if (scientific) {
      p <- p + scale_x_continuous(labels = scientific_format())
    }
    
    plot_list[[length(plot_list) + 1]] <- p
  }
  
  # Arrange all plots in a grid
  grid.arrange(grobs = plot_list, nrow = n_rows, ncol = n_cols)
}
# Box Plot for one feature
plot_box <- function(feature, color = "skyblue") { ggplot(train,
aes_string(y = feature)) + geom_boxplot(fill = color, outlier.color =
"red", na.rm = TRUE) + labs(title = paste("Box Plot of", feature), y =
feature) + theme_minimal() }
# Bar Plot for one feature
plot_bar <- function(feature) { ggplot(train, aes_string(y = feature,
fill = "factor(LOAN_DEFAULT)")) + geom_bar(position = "dodge", color =
"black") + scale_fill_manual(values = c("0" = "skyblue", "1" = "pink"),
name = "LOAN_DEFAULT", labels = c("No Default", "Default")) + labs(title
= paste("Bar Plot of", feature, "by Loan Default"), y = feature, x =
"Count") + theme_minimal() + theme( axis.text.y = element_text(size =
10), plot.title = element_text(size = 14, face = "bold") ) }

Now, let’s analyze important explanatory variable: ‘DISBURSED_AMOUNT’.

summary(train$DISBURSED_AMOUNT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13320   47145   53803   54357   60413  990572
plot_distribution("DISBURSED_AMOUNT", "green")

plot_box("DISBURSED_AMOUNT", "green")

Based on the box plot, we can observe the presence of the outliers. We will try two methods of dealing with outliers: replacing the outliers’ values with a sample mean and binning. But first, let’s define functions that will speed up the outlier detection process. We will create a function that will calculate all necessary statistical values: mean, lower and upper threshold.

outlier_data <- function(df, feature) {
  # Number of observations
  obs <- length(df[[feature]])
  cat("No. of observations in column:", obs, "\n")
  
  # Descriptive statistics
  data_mean <- mean(df[[feature]], na.rm = TRUE)
  data_sd   <- sd(df[[feature]], na.rm = TRUE)
  cat(sprintf("Statistics: Mean = %.3f, Std dev = %.3f\n", data_mean, data_sd))
  
  # Thresholds for outliers, set as 3 standard deviations
  cut_off <- data_sd * 3
  lower <- data_mean - cut_off
  upper <- data_mean + cut_off
  
  # Outliers count
  outliers <- df[[feature]][df[[feature]] < lower | df[[feature]] > upper]
  cat("Identified outliers:", length(outliers), "\n")
  
  return(list(lower = lower, upper = upper, mean = data_mean))
}

Now, let’s play with smoothing the outliers. We will create the function that will impute the outstanding obseravtions.

impute_outlier <- function(vec, lower, upper, mean_val) {
  sapply(vec, function(x) {
    if (is.na(x)) {
      return(NA)
    } else if (x <= lower || x >= upper) {
      return(mean_val)
    } else {
      return(x)
    }
  })
}
disbursed_amount_stats <- outlier_data(train, 'DISBURSED_AMOUNT')
## No. of observations in column: 233154 
## Statistics: Mean = 54356.994, Std dev = 12971.314
## Identified outliers: 3076

We manage to identify more that 3000 outliers. Now we will replace the values with the mean.

train$DISBURSED_AMOUNT_new <- impute_outlier(train$DISBURSED_AMOUNT, disbursed_amount_stats$lower, disbursed_amount_stats$upper, disbursed_amount_stats$mean)

# No. of observations after the imputation
cat("No. of observations in column: ", length(train$DISBURSED_AMOUNT_new), "\n")
## No. of observations in column:  233154

Now let’s try binning. Firstly, we have too divide our variable’s range into bins. We chose to divide it to 4 bins based on quantiles.

bin_labels <- c("Low", "Medium", "High", "Extreme")

quantiles <- quantile(train$DISBURSED_AMOUNT, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Quantile distribution
train$DISBURSED_AMOUNT_bins <- cut(train$DISBURSED_AMOUNT,
                                   breaks = quantiles,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$DISBURSED_AMOUNT_bins)
## 
##     Low  Medium    High Extreme 
##   58537   58676   57734   58207
plot_bar("DISBURSED_AMOUNT_bins")

Now let’s analyze another variable: ‘ASSET_COST’

summary(train$ASSET_COST)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37000   65717   70946   75865   79202 1628992
plot_distribution <- function(feature, color = "steelblue") {
  ggplot(train, aes_string(x = feature)) +
    geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
    geom_density(color = "black", size = 1, na.rm = TRUE) +
    labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
    theme_minimal()
}
plot_distribution("ASSET_COST", "tomato")

plot_box("ASSET_COST", "tomato")

Again, we can notice the presence of the outliers. We can apply similar approach as before.

Imputation

asset_cost_stats <- outlier_data(train, 'ASSET_COST')
## No. of observations in column: 233154 
## Statistics: Mean = 75865.068, Std dev = 18944.781
## Identified outliers: 4425

Almost 4500 outliers detected - let’s try smoothing them out with mean

train$ASSET_COST_new <- impute_outlier(train$ASSET_COST, asset_cost_stats$lower, asset_cost_stats$upper, asset_cost_stats$mean)

# No. of observations after the imputation

cat("No. of observations in column: ", length(train$ASSET_COST_new), "\n")
## No. of observations in column:  233154

Binning

quantiles <- quantile(train$ASSET_COST, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

#Quantile distribution
train$ASSET_COST_bins <- cut(train$ASSET_COST,
                                   breaks = quantiles,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$ASSET_COST_bins)
## 
##     Low  Medium    High Extreme 
##   58290   58288   58287   58289
plot_bar("ASSET_COST_bins")

‘LTV’

summary(train$LTV)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.03   68.88   76.80   74.75   83.67   95.00
plot_distribution <- function(feature, color = "steelblue") {
  ggplot(train, aes_string(x = feature)) +
    geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
    geom_density(color = "black", size = 1, na.rm = TRUE) +
    labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
    theme_minimal()
}
plot_distribution("LTV", "tomato")

plot_box("LTV", "tomato")

Imputation

LTV_stats <- outlier_data(train, 'LTV')
## No. of observations in column: 233154 
## Statistics: Mean = 74.747, Std dev = 11.457
## Identified outliers: 2745
train$LTV_new <- impute_outlier(train$LTV, LTV_stats$lower, LTV_stats$upper, LTV_stats$mean)

# No. of observations after the imputation

cat("No. of observations in column: ", length(train$LTV_new), "\n")
## No. of observations in column:  233154

Binning

quantiles <- quantile(train$LTV, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

#Quantile distribution
train$LTV_bins <- cut(train$LTV,
                                   breaks = quantiles,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$LTV_bins)
## 
##     Low  Medium    High Extreme 
##   58303   58299   58285   58267
plot_bar("LTV_bins")

‘PERFORM_CNS_SCORE’

summary(train$PERFORM_CNS_SCORE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   289.5   678.0   890.0
plot_distribution <- function(feature, color = "steelblue") {
  ggplot(train, aes_string(x = feature)) +
    geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
    geom_density(color = "black", size = 1, na.rm = TRUE) +
    labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
    theme_minimal()
}
plot_distribution("PERFORM_CNS_SCORE", "tomato")

plot_box("PERFORM_CNS_SCORE", "tomato")

Imputation

perform_cns_score_stats <- outlier_data(train, 'PERFORM_CNS_SCORE')
## No. of observations in column: 233154 
## Statistics: Mean = 289.463, Std dev = 338.375
## Identified outliers: 0
train$PERFORM_CNS_SCORE_new <- impute_outlier(train$PERFORM_CNS_SCORE, perform_cns_score_stats$lower, perform_cns_score_stats$upper, perform_cns_score_stats$mean)

# No. of observations after the imputation

cat("No. of observations in column: ", length(train$PERFORM_CNS_SCORE_new), "\n")
## No. of observations in column:  233154

Binning

bin_labels = c("No History",'Very Low', "Low" ,'Medium', 'High')
cut_bins = c(-1,10,150, 350, 650, 1000)

quantiles <- quantile(train$PERFORM_CNS_SCORE, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

#Quantile distribution
train$PERFORM_CNS_SCORE_bins <- cut(train$PERFORM_CNS_SCORE,
                                   breaks = cut_bins,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$PERFORM_CNS_SCORE_bins)
## 
## No History   Very Low        Low     Medium       High 
##     116950      12835       9910      28425      65034
plot_bar("PERFORM_CNS_SCORE_bins")

‘PERFORM_CNS_SCORE_DESCRIPTION’

train %>%
  group_by(PERFORM_CNS_SCORE_DESCRIPTION, PERFORM_CNS_SCORE_bins) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count))
## # A tibble: 20 × 3
##    PERFORM_CNS_SCORE_DESCRIPTION                   PERFORM_CNS_SCORE_bins  count
##    <chr>                                           <fct>                   <int>
##  1 No Bureau History Available                     No History             116950
##  2 C-Very Low Risk                                 High                    16045
##  3 A-Very Low Risk                                 High                    14124
##  4 D-Very Low Risk                                 High                    11358
##  5 B-Very Low Risk                                 High                     9201
##  6 M-Very High Risk                                Low                      8776
##  7 F-Low Risk                                      High                     8485
##  8 K-High Risk                                     Medium                   8277
##  9 H-Medium Risk                                   Medium                   6855
## 10 E-Low Risk                                      High                     5821
## 11 I-Medium Risk                                   Medium                   5557
## 12 G-Low Risk                                      Medium                   3988
## 13 Not Scored: Sufficient History Not Available    Very Low                 3765
## 14 J-High Risk                                     Medium                   3748
## 15 Not Scored: Not Enough Info available on the c… Very Low                 3672
## 16 Not Scored: No Activity seen on the customer (… Very Low                 2885
## 17 Not Scored: No Updates available in last 36 mo… Very Low                 1534
## 18 L-Very High Risk                                Low                      1134
## 19 Not Scored: Only a Guarantor                    Very Low                  976
## 20 Not Scored: More than 50 active Accounts found  Very Low                    3
table(train$PERFORM_CNS_SCORE_DESCRIPTION)
## 
##                                         A-Very Low Risk 
##                                                   14124 
##                                         B-Very Low Risk 
##                                                    9201 
##                                         C-Very Low Risk 
##                                                   16045 
##                                         D-Very Low Risk 
##                                                   11358 
##                                              E-Low Risk 
##                                                    5821 
##                                              F-Low Risk 
##                                                    8485 
##                                              G-Low Risk 
##                                                    3988 
##                                           H-Medium Risk 
##                                                    6855 
##                                           I-Medium Risk 
##                                                    5557 
##                                             J-High Risk 
##                                                    3748 
##                                             K-High Risk 
##                                                    8277 
##                                        L-Very High Risk 
##                                                    1134 
##                                        M-Very High Risk 
##                                                    8776 
##                             No Bureau History Available 
##                                                  116950 
##          Not Scored: More than 50 active Accounts found 
##                                                       3 
## Not Scored: No Activity seen on the customer (Inactive) 
##                                                    2885 
##      Not Scored: No Updates available in last 36 months 
##                                                    1534 
##   Not Scored: Not Enough Info available on the customer 
##                                                    3672 
##                            Not Scored: Only a Guarantor 
##                                                     976 
##            Not Scored: Sufficient History Not Available 
##                                                    3765
gg <- train %>%
  group_by(PERFORM_CNS_SCORE_DESCRIPTION, LOAN_DEFAULT) %>%
  summarise(counts = n(), .groups = "drop") %>%
  group_by(PERFORM_CNS_SCORE_DESCRIPTION) %>%
  mutate(percentage = counts / sum(counts) * 100) %>%
  ungroup()

print(gg)
## # A tibble: 39 × 4
##    PERFORM_CNS_SCORE_DESCRIPTION LOAN_DEFAULT counts percentage
##    <chr>                         <fct>         <int>      <dbl>
##  1 A-Very Low Risk               0             11783       83.4
##  2 A-Very Low Risk               1              2341       16.6
##  3 B-Very Low Risk               0              7993       86.9
##  4 B-Very Low Risk               1              1208       13.1
##  5 C-Very Low Risk               0             13275       82.7
##  6 C-Very Low Risk               1              2770       17.3
##  7 D-Very Low Risk               0              9659       85.0
##  8 D-Very Low Risk               1              1699       15.0
##  9 E-Low Risk                    0              4821       82.8
## 10 E-Low Risk                    1              1000       17.2
## # ℹ 29 more rows

‘PRI_NO_OF_ACCTS’

summary(train$PRI_NO_OF_ACCTS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.441   3.000 453.000
plot_distribution <- function(feature, color = "steelblue") {
  ggplot(train, aes_string(x = feature)) +
    geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
    geom_density(color = "black", size = 1, na.rm = TRUE) +
    labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
    theme_minimal()
}
plot_distribution("PRI_NO_OF_ACCTS", "tomato")

plot_box("PRI_NO_OF_ACCTS", "tomato")

Imputation

pri_no_of_accts_stats <- outlier_data(train, 'PRI_NO_OF_ACCTS')
## No. of observations in column: 233154 
## Statistics: Mean = 2.441, Std dev = 5.217
## Identified outliers: 4119
train$PRI_NO_OF_ACCTS_new <- impute_outlier(train$PRI_NO_OF_ACCTS, pri_no_of_accts_stats$lower, pri_no_of_accts_stats$upper, pri_no_of_accts_stats$mean)

# No. of observations after the imputation

cat("No. of observations in column: ", length(train$PRI_NO_OF_ACCTS_new), "\n")
## No. of observations in column:  233154

Binning

bin_labels <- c("One", "More than One")
cut_bins <- c(-1, 1, 1000)

train$PRI_NO_OF_ACCTS_bins <- cut(train$PRI_NO_OF_ACCTS,
                                   breaks = cut_bins,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$PRI_NO_OF_ACCTS_bins)
## 
##           One More than One 
##        151928         81226
plot_bar("PRI_NO_OF_ACCTS_bins")

‘PRI_OVERDUE_ACCTS’

summary(train$PRI_OVERDUE_ACCTS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1565  0.0000 25.0000
plot_distribution <- function(feature, color = "steelblue") {
  ggplot(train, aes_string(x = feature)) +
    geom_histogram(aes(y = ..density..), fill = color, bins = 100, alpha = 0.7, na.rm = TRUE) +
    geom_density(color = "black", size = 1, na.rm = TRUE) +
    labs(title = paste("Distribution of", feature), x = feature, y = "Density") +
    theme_minimal()
}
plot_distribution("PRI_OVERDUE_ACCTS", "tomato")

plot_box("PRI_OVERDUE_ACCTS", "tomato")

Imputation

pri_overdue_accts_stats <- outlier_data(train, 'PRI_OVERDUE_ACCTS')
## No. of observations in column: 233154 
## Statistics: Mean = 0.157, Std dev = 0.549
## Identified outliers: 6305
train$PRI_OVERDUE_ACCTS_new <- impute_outlier(train$PRI_OVERDUE_ACCTS, pri_overdue_accts_stats$lower, pri_overdue_accts_stats$upper, pri_overdue_accts_stats$mean)

# No. of observations after the imputation

cat("No. of observations in column: ", length(train$PRI_OVERDUE_ACCTS_new), "\n")
## No. of observations in column:  233154

Binning

bin_labels <- c("None", "One (or more)")
cut_bins <- c(-1, 0, 1000)

train$PRI_OVERDUE_ACCTS_bins <- cut(train$PRI_OVERDUE_ACCTS,
                                   breaks = cut_bins,
                                   include.lowest = TRUE,
                                   labels = bin_labels)

table(train$PPRI_OVERDUE_ACCTS_bins)
## < table of extent 0 >
plot_bar("PRI_OVERDUE_ACCTS_bins")

Let’s look into data with lesser importance

var <- c("MOBILENO_AVL_FLAG", "AADHAR_FLAG", "PAN_FLAG", "VOTERID_FLAG", "PASSPORT_FLAG", "DRIVING_FLAG")
plot_bar_comp(var, nrow = 3)

Univariate Analysis

‘EMPLOYMENT_TYPE’

ggplot(train, aes(x = EMPLOYMENT_TYPE, fill = factor(LOAN_DEFAULT))) +
geom_bar(position = "dodge", color = "black") + labs(title =
"EMPLOYMENT_TYPE vs LOAN_DEFAULT", x = "Employment Type", y = "Count",
fill = "Loan Default") + theme_minimal() + scale_fill_manual(values =
c("0" = "skyblue", "1" = "tomato"))

Age is in days

now <- Sys.Date()
train$DATE_OF_BIRTH <- as.Date(train$DATE_OF_BIRTH, format = "%d-%m-%Y")

# Calculate age in days
train$age <- as.numeric(difftime(now, train$DATE_OF_BIRTH, units = "days"))

# Print first few ages
print(head(train$age))
## [1] 15128 14551 14527 11477 17342 12686
train$disbursal_time <- as.numeric(difftime(now, train$DISBURSAL_DATE, units = "days"))
head(train$disbursal_time)
## [1] 2495 2441 2497 2411 2441 2448

‘MANUFACTURER_ID’

ggplot(train, aes(x = factor(MANUFACTURER_ID), fill = factor(LOAN_DEFAULT))) +
  geom_bar(position = "dodge", color = "black") +
  labs(title = "MANUFACTURER_ID vs LOAN_DEFAULT",
       x = "MANUFACTURER_ID",
       y = "Count",
       fill = "Loan Default") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "tomato"),
                    labels = c("No Default", "Default")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

‘BRANCH_ID’

ggplot(train, aes(x = factor(BRANCH_ID), fill = factor(LOAN_DEFAULT))) +
  geom_bar(position = "dodge", color = "black") +
  labs(title = "BRANCH_ID vs LOAN_DEFAULT",
       x = "BRANCH_ID",
       y = "Count",
       fill = "Loan Default") +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "tomato"),
                    labels = c("No Default", "Default")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

var <- c("PRI_NO_OF_ACCTS_new", "PRI_ACTIVE_ACCTS", "PRI_OVERDUE_ACCTS_new",
         "PRI_CURRENT_BALANCE", "PRI_SANCTIONED_AMOUNT", "PRI_DISBURSED_AMOUNT")

plot_distribution_comp(train, var)

Let’s see the new columns along with the less important continous variables

var <- c("SEC_NO_OF_ACCTS", "SEC_ACTIVE_ACCTS", "SEC_OVERDUE_ACCTS",
"SEC_CURRENT_BALANCE", "SEC_SANCTIONED_AMOUNT", "SEC_DISBURSED_AMOUNT")

plot_distribution_comp(train, var)

Feature Selection

Now we will proceed with the feature selection. First, we will drop the variables that won’t be used in our model.

train <- train %>%
  select(-DATE_OF_BIRTH, -STATE_ID, -EMPLOYEE_CODE_ID,
         -SUPPLIER_ID, -MANUFACTURER_ID, -CURRENT_PINCODE_ID, -BRANCH_ID)

Now, let’s calucate the correlation matrix to see which features are correlated with each other in order to decrease a number of used variables.

corr_cols <- c("PRI_ACTIVE_ACCTS", "PRI_CURRENT_BALANCE",
"PRI_SANCTIONED_AMOUNT", "PRI_DISBURSED_AMOUNT", "SEC_NO_OF_ACCTS",
"SEC_ACTIVE_ACCTS", "SEC_OVERDUE_ACCTS", "SEC_CURRENT_BALANCE",
"SEC_SANCTIONED_AMOUNT", "SEC_DISBURSED_AMOUNT", "PRI_NO_OF_ACCTS_new",
"PRI_OVERDUE_ACCTS_new")

corr_data <- train[, corr_cols]

corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")

corr_melted <- melt(corr_matrix)

# Create interactive heatmap
plot_ly(data = corr_melted,
        x = ~Var1,
        y = ~Var2,
        z = ~value,
        zmin = -1,              
        zmax = 1,
        type = "heatmap",
        colors = colorRamp(c("steelblue", "white", "darkgreen")))

Not highly correlated with anyone: ‘PRI_ACTIVE_ACCTS’, ‘PRI_CURRENT_BALANCE’,‘PRI_SANCTIONED_AMOUNT’, ‘PRI_DISBURSED_AMOUNT’,‘SEC_OVERDUE_ACCTS’.

‘PRI_NO_OF_ACCTS_new’,‘PRI_OVERDUE_ACCTS_new’ are perfectly positively correlated and hence we are keeping only one.

‘SEC_NO_OF_ACCTS’, ‘SEC_ACTIVE_ACCTS’ are highly positively correlated, hence we are keeping only one.

‘SEC_CURRENT_BALANCE’, ‘SEC_SANCTIONED_AMOUNT’, ‘SEC_DISBURSED_AMOUNT’ are highly positively correlated, hence we are keeping only one.

train <- train %>%
  select(-PRI_OVERDUE_ACCTS_new, -SEC_ACTIVE_ACCTS, -SEC_SANCTIONED_AMOUNT, -SEC_DISBURSED_AMOUNT)

Now let’s analyze other variables.

corr_data <- train[, c('SEC_INSTAL_AMT',
'PERFORM_CNS_SCORE','NEW_ACCTS_IN_LAST_SIX_MONTHS',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'AVERAGE_ACCT_AGE',
'CREDIT_HISTORY_LENGTH', 'NO_OF_INQUIRIES','age', 'disbursal_time')]

corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")

corr_melted <- melt(corr_matrix)

# Create interactive heatmap
plot_ly(data = corr_melted,
        x = ~Var1,
        y = ~Var2,
        z = ~value,
        zmin = -1,            
        zmax = 1,
        type = "heatmap",
        colors = colorRamp(c("steelblue", "white", "darkgreen")))

‘AVERAGE_ACCT_AGE’, ‘CREDIT_HISTORY_LENGTH’ are highly positively correlated and hence we are keeping only one.

train <- train %>%
  select(-AVERAGE_ACCT_AGE)
corr_data <- train[, c('PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE',
'PRI_SANCTIONED_AMOUNT', 'PERFORM_CNS_SCORE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT', 'NEW_ACCTS_IN_LAST_SIX_MONTHS',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'DISBURSED_AMOUNT_new', 'ASSET_COST_new', 'LTV_new',
'PRI_NO_OF_ACCTS_new', 'age', 'disbursal_time')]

corr_matrix <- cor(corr_data, use = "pairwise.complete.obs")

corr_melted <- melt(corr_matrix)

# Create interactive heatmap
plot_ly(data = corr_melted,
        x = ~Var1,
        y = ~Var2,
        z = ~value,
        zmin = -1,            
        zmax = 1,
        type = "heatmap",
        colors = colorRamp(c("steelblue", "white", "darkgreen")))

Based on the correlation matrix we decided to: Choose one out of ‘PRI_SANCTIONED_AMOUNT’, ‘PRI_DISBURSED_AMOUNT’ Choose one out of ‘LTV_new’, ‘PRI_NO_OF_ACCTS_new’ And eliminate ’NEW_ACCTS_IN_LAST_SIX_MONTHS

train <- train %>%
  select(-PRI_SANCTIONED_AMOUNT,-PRI_NO_OF_ACCTS_new,-NEW_ACCTS_IN_LAST_SIX_MONTHS)

Now we will prepare our data sets. One will contain only continuous variables, with the outliers treated with imputation. Other will also contain binned variables.

Continuous Variables

train_con <- train[, c('EMPLOYMENT_TYPE', 'MOBILENO_AVL_FLAG',
'AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG', 'DRIVING_FLAG',
'PASSPORT_FLAG', 'PERFORM_CNS_SCORE', 'PERFORM_CNS_SCORE_DESCRIPTION',
'PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'LOAN_DEFAULT', 'DISBURSED_AMOUNT_new',
'ASSET_COST_new', 'LTV_new', 'age', 'disbursal_time')]

Binned Variables

train_bin <- train[, c('UNIQUEID', 'EMPLOYMENT_TYPE',
'MOBILENO_AVL_FLAG', 'AADHAR_FLAG', 'PAN_FLAG', 'VOTERID_FLAG',
'DRIVING_FLAG', 'PASSPORT_FLAG', 'PERFORM_CNS_SCORE',
'PERFORM_CNS_SCORE_DESCRIPTION', 'PRI_ACTIVE_ACCTS',
'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT', 'SEC_NO_OF_ACCTS',
'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE', 'PRIMARY_INSTAL_AMT',
'SEC_INSTAL_AMT', 'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS',
'CREDIT_HISTORY_LENGTH', 'NO_OF_INQUIRIES', 'LOAN_DEFAULT',
'DISBURSED_AMOUNT_bins', 'ASSET_COST_bins', 'LTV_bins',
'PERFORM_CNS_SCORE_bins', 'PRI_NO_OF_ACCTS_bins',
'PRI_OVERDUE_ACCTS_bins', 'age', 'disbursal_time')]

Standardization of data

Let’s transform some of the variables for better modelling.

scaleColumns <- function(df, cols_to_scale) { 
  for (col in cols_to_scale) { 
    df[[col]] <- scale(df[[col]]) 
  } 
  return(df) 
  }
scaled_df <- scaleColumns(train_con, c('PERFORM_CNS_SCORE',
'PRI_ACTIVE_ACCTS', 'PRI_CURRENT_BALANCE', 'PRI_DISBURSED_AMOUNT',
'SEC_NO_OF_ACCTS', 'SEC_OVERDUE_ACCTS', 'SEC_CURRENT_BALANCE',
'PRIMARY_INSTAL_AMT', 'SEC_INSTAL_AMT',
'DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS', 'CREDIT_HISTORY_LENGTH',
'NO_OF_INQUIRIES', 'DISBURSED_AMOUNT_new', 'ASSET_COST_new', 'LTV_new',
'age', 'disbursal_time')
)

head(scaled_df)
## # A tibble: 6 × 26
##   EMPLOYMENT_TYPE MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG
##   <chr>                       <dbl>       <dbl>    <dbl>        <dbl>
## 1 Salaried                        1           1        0            0
## 2 Self employed                   1           1        0            0
## 3 Self employed                   1           1        0            0
## 4 Self employed                   1           1        0            0
## 5 Self employed                   1           1        0            0
## 6 Self employed                   1           1        0            0
## # ℹ 21 more variables: DRIVING_FLAG <dbl>, PASSPORT_FLAG <dbl>,
## #   PERFORM_CNS_SCORE <dbl[,1]>, PERFORM_CNS_SCORE_DESCRIPTION <chr>,
## #   PRI_ACTIVE_ACCTS <dbl[,1]>, PRI_CURRENT_BALANCE <dbl[,1]>,
## #   PRI_DISBURSED_AMOUNT <dbl[,1]>, SEC_NO_OF_ACCTS <dbl[,1]>,
## #   SEC_OVERDUE_ACCTS <dbl[,1]>, SEC_CURRENT_BALANCE <dbl[,1]>,
## #   PRIMARY_INSTAL_AMT <dbl[,1]>, SEC_INSTAL_AMT <dbl[,1]>,
## #   DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS <dbl[,1]>, …

Dummy insertion

Let’s turn the categorical variables into the dummy variables.

train_dummy <- fastDummies::dummy_cols( scaled_df, remove_first_dummy = TRUE, remove_selected_columns = TRUE, ignore_na = TRUE)

head(train_dummy)
## # A tibble: 6 × 44
##   MOBILENO_AVL_FLAG AADHAR_FLAG PAN_FLAG VOTERID_FLAG DRIVING_FLAG PASSPORT_FLAG
##               <dbl>       <dbl>    <dbl>        <dbl>        <dbl>         <dbl>
## 1                 1           1        0            0            0             0
## 2                 1           1        0            0            0             0
## 3                 1           1        0            0            0             0
## 4                 1           1        0            0            0             0
## 5                 1           1        0            0            0             0
## 6                 1           1        0            0            0             0
## # ℹ 38 more variables: PERFORM_CNS_SCORE <dbl>, PRI_ACTIVE_ACCTS <dbl>,
## #   PRI_CURRENT_BALANCE <dbl>, PRI_DISBURSED_AMOUNT <dbl>,
## #   SEC_NO_OF_ACCTS <dbl>, SEC_OVERDUE_ACCTS <dbl>, SEC_CURRENT_BALANCE <dbl>,
## #   PRIMARY_INSTAL_AMT <dbl>, SEC_INSTAL_AMT <dbl>,
## #   DELINQUENT_ACCTS_IN_LAST_SIX_MONTHS <dbl>, CREDIT_HISTORY_LENGTH <dbl>,
## #   NO_OF_INQUIRIES <dbl>, DISBURSED_AMOUNT_new <dbl>, ASSET_COST_new <dbl>,
## #   LTV_new <dbl>, age <dbl>, disbursal_time <dbl>, …

Finally, let’s divide our modified dataset into a train and test set.

y <- train_dummy["LOAN_DEFAULT_1"]

X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]

dim(y)
## [1] 233154      1
set.seed(101) 
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.80, list = FALSE)

# Create training and testing datasets
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices, ]
y_test <- y[-train_indices, ]

k_fold <- trainControl(method = "cv", number = 10)
X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0

Modeling

Now we can proceed to modeling. Firstly, we will define a couple of functions that will help us in the modeling process.

# Confusion Matrix
plot_confusion_matrix <- function(cm, classes, normalize = FALSE, title = "Confusion Matrix") {
  cm_df <- as.data.frame(cm) 
  colnames(cm_df) <- c("True", "Predicted","Freq")
  
  if (normalize) { 
    cm_df <- cm_df %>% 
      group_by(True) %>% 
      mutate(Freq = Freq / sum(Freq)) %>% 
      ungroup() 
    }
  max_val <- max(cm_df$Freq) 
  cm_df$text_color <- ifelse(cm_df$Freq > max_val / 2, "white", "black")
  
  decimals <- if(normalize) 2 else 0
  
  ggplot(cm_df, aes(x = Predicted, y = True, fill = Freq)) +
    geom_tile(color = "white") + 
    geom_text(aes(label = round(Freq, decimals), color = text_color), size = 4) +
    scale_fill_gradient(low = "white", high = "blue") +
    scale_color_identity() + 
    labs(title = title, x = "Predicted label", y = "True label") + 
    theme_minimal() 
}
# Precision, Recall, F1 Score

show_metrics <- function(cm) { 
  cm <- as.matrix(cm)
  
  tn <- cm[1, 1] 
  fp <- cm[1, 2] 
  fn <- cm[2, 1] 
  tp <- cm[2, 2]
  
  precision <- tp / (tp + fp) 
  recall <- tp / (tp + fn) 
  f1_score <- 2 * ((precision * recall) / (precision + recall))
  
  cat(sprintf("Precision = %.3f\n", precision)) 
  cat(sprintf("Recall = %.3f\n", recall)) 
  cat(sprintf("F1_score = %.3f\n", f1_score)) 
}
# Precision-recall curve

plot_precision_recall <- function(recall, precision) {
  df <- data.frame(recall = recall, precision = precision)
  
  ggplot(df, aes(x = recall, y = precision)) +
    geom_step(direction = "hv", alpha = 0.5, color = "blue") + 
    geom_ribbon(aes(ymin = 0, ymax = precision), alpha = 0.2, fill = "blue") +
    xlim(0.0, 1.0) + ylim(0.0, 1.05) +
    labs(title = "Precision-Recall Curve", x = "Recall", y = "Precision") +
    theme_minimal()
}
# ROC curve

plot_roc <- function(fpr, tpr) { 
  df <- data.frame(fpr = fpr, tpr = tpr)
  
  ggplot(df, aes(x = fpr, y = tpr)) +
    geom_line(color = "blue", linewidth = 1.2) + 
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color= "black", linewidth = 1) + 
    xlim(0.0, 1.0) + ylim(0.0, 1.05) + 
    labs(title = "ROC Curve", x = "False Positive Rate", y = "True Positive Rate") +
    theme_minimal()
}
#feature importance plot

plot_feature_importance <- function(model, predictors) { 
  tmp <- data.frame( Feature = predictors, Feature_importance = model$importance)
  tmp <- tmp[order(tmp$Feature_importance, decreasing = TRUE), ]
  
  ggplot(tmp, aes(x = reorder(Feature, Feature_importance), y =Feature_importance)) + 
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() + labs(title = "Features importance", x = "Feature", y ="Feature importance") + 
    theme_minimal() + theme(plot.title = element_text(size = 14)) 
  }

print(class(y_train))

Logistic Regression

print(class(y_train))
## [1] "tbl_df"     "tbl"        "data.frame"
if(is.list(y_train) && length(y_train) == 1) {
  y_train <- unlist(y_train)
}
logmodel <- glm(y_train ~ ., data = X_train, family = "binomial")

log_probs <- predict(logmodel, newdata = X_test, type = "response")
# For class predictions (equivalent to Python's predict):
logpred <- ifelse(log_probs > 0.5, 1, 0)
# Print confusion matrix
if(is.data.frame(logpred)) {
  logpred <- as.vector(unlist(logpred))
}
if(is.list(y_test) && length(y_test) == 1) {
  y_test <- unlist(y_test)
}
# Cross-validation
# Define k-fold cross-validation control
ctrl <- trainControl(method = "cv", number = 10)  # Assuming k_fold = 5 in your Python code

# Perform cross-validation
cv_model <- train(
  x = as.matrix(X_train),
  y = factor(y_train),
  method = "glm",
  family = "binomial",
  trControl = ctrl,
  metric = "Accuracy"
)

# Print cross-validated accuracy
LOGCV <- cv_model$results$Accuracy
print(paste0("Cross-validated accuracy: ", round(LOGCV * 100, 2), "%"))
## [1] "Cross-validated accuracy: 78.3%"
cm_log <- conf_matrix <- confusionMatrix(factor(logpred, levels = c(0,1)), factor(y_test, levels = c(0,1)))
saveRDS(cm_log, "cm_log.rds")

# Extract and print the metrics 
# Accuracy
cat("Accuracy of model", cm_log$overall["Accuracy"], "\n")
## Accuracy of model 0.7815355
# F1 Score
cat("F1 Score", cm_log$byClass["F1"], "\n")
## F1 Score 0.8772162
# Recall Score
cat("Recall Score", cm_log$byClass["Sensitivity"], "\n")
## Recall Score 0.9986553
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_log$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.501928

Random Forest

set.seed(42)  

if(!is.factor(y_train)) {
  y_train_factor <- factor(y_train)
} else {
  y_train_factor <- y_train
}

# Train the Random Forest model
rfc <- randomForest(
  x = X_train,             
  y = y_train_factor,      
  ntree = 10,               
  importance = TRUE         
)

# Predict on test set
rfc_pred <- predict(rfc, X_test)
# Ensure factors have same levels for confusion matrix
y_test_factor <- factor(y_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))

# Create confusion matrix and print it
cm_rf <- confusionMatrix(rfc_pred_factor, factor(y_test))
print(cm_rf$table)
##           Reference
## Prediction     0     1
##          0 35792  9807
##          1   647   384
saveRDS(cm_rf, "cm_rf.rds")

# Print results (equivalent to the Python code)
cat("Accuracy of model", cm_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.7758096
cat("F1 Score", cm_rf$byClass["F1"], "\n")
## F1 Score 0.8725712
cat("Recall Score", cm_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.9822443
cat("Balanced Accuracy Score", cm_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5099623

Naive Bayes

nb_model <- naiveBayes(x = X_train, y = factor(y_train))

# Predict on test set

nb_pred <- predict(nb_model, newdata = X_test, type = "class")

# Ensure predictions and actual values are factors with consistent levels for confusion matrix
y_test_factor <- factor(y_test, levels = levels(nb_pred))

# Create confusion matrix and print it
cm_nb <- confusionMatrix(nb_pred, y_test_factor)
saveRDS(cm_nb, "cm_nb.rds")
print(cm_nb$table)
##           Reference
## Prediction     0     1
##          0 11838  2248
##          1 24601  7943
# Print accuracy as percentage
cat("Accuracy of model (Naive Bayes): ", round(cm_nb$overall["Accuracy"] * 100, 2), "%\n")
## Accuracy of model (Naive Bayes):  42.42 %
cat("F1 Score: ", cm_nb$byClass["F1"], "\n")
## F1 Score:  0.4685997
cat("Recall Score (Sensitivity): ", cm_nb$byClass["Sensitivity"], "\n")
## Recall Score (Sensitivity):  0.3248717
cat("Balanced Accuracy Score: ", cm_nb$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score:  0.5521425

Stochastic Gradient Descent

set.seed(101)

# Preprocess y_train
if (is.list(y_train) || is.data.frame(y_train)) {
  y_train <- unlist(y_train)
}
y_train <- factor(y_train)
y_test <- factor(y_test, levels = levels(y_train))

k_fold <- 10
ctrl <- trainControl(method = "cv", number = k_fold, classProbs = FALSE, summaryFunction = defaultSummary)

# Train SGD model using caret with glmnet method (elastic net logistic regression, similar to SGD)
sgd_model <- train(
  x = X_train,
  y = y_train,
  method = "glmnet",
  trControl = ctrl,
  metric = "Accuracy"
)

# Predict on test data
sgd_pred <- predict(sgd_model, X_test)
# Confusion Matrix
cm_sgd <- confusionMatrix(sgd_pred, y_test, positive = levels(y_test)[2])
saveRDS(cm_sgd, "cm_sgd.rds")
print(cm_sgd$table)
##           Reference
## Prediction     0     1
##          0 36438 10191
##          1     1     0
# Accuracy
accuracy <- cm_sgd$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 78.14 %
# F1 Score
f1 <- F1_Score(y_pred = sgd_pred, y_true = y_test, positive = levels(y_test)[2])
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: NaN
# Recall (Sensitivity)
recall <- cm_sgd$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_sgd$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.4999863
# Cross-validated accuracy 
cv_accuracy <- max(sgd_model$results$Accuracy)
cat("Cross-validated accuracy:", round(cv_accuracy * 100, 2), "%\n")
## Cross-validated accuracy: 78.33 %

Decision Tree Classifier

set.seed(101)

# Preprocess target variables
if (is.list(y_train) || is.data.frame(y_train)) {
  y_train <- unlist(y_train)
}
y_train <- factor(y_train)
y_test <- factor(y_test, levels = levels(y_train))

k_fold <- 10
ctrl <- trainControl(method = "cv", number = k_fold)

# Train Decision Tree model with parameters similar to sklearn
dtree_model <- train(
  x = X_train,
  y = y_train,
  method = "rpart",
  tuneGrid = data.frame(cp = 0),
  trControl = ctrl,
  control = rpart.control(maxdepth = 10, minbucket = 30, maxcompete = 0, maxsurrogate = 0)
)

# Predict on test set
dtree_pred <- predict(dtree_model, X_test)

# Confusion matrix
cm_dt <- confusionMatrix(dtree_pred, y_test, positive = levels(y_test)[2])
saveRDS(cm_dt, "cm_dt.rds")
print(cm_dt$table)
##           Reference
## Prediction     0     1
##          0 36125  9997
##          1   314   194
# Accuracy
accuracy <- cm_dt$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 77.89 %
# F1 Score
f1 <- F1_Score(y_pred = dtree_pred, y_true = y_test, positive = levels(y_test)[2])
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: 0.036
# Recall (Sensitivity)
recall <- cm_dt$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0.019
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_dt$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.5052096
# Cross-validated accuracy (already done during train)
cv_accuracy <- max(dtree_model$results$Accuracy)
cat("Cross-validated accuracy:", round(cv_accuracy * 100, 2), "%\n")
## Cross-validated accuracy: 77.97 %

XGBoost

train_with_progress_bar <- function(X_train, y_train, nrounds = 100) {
  # Convert data
  dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = as.numeric(y_train) - 1)
  
  # Create progress bar
  pb <- progress_bar$new(
    format = "  Training [:bar] :percent eta: :eta",
    total = nrounds, 
    clear = FALSE, 
    width = 60
  )
  
  # Create a proper named callback function
  progress_callback <- function(env = parent.frame()) {
    pb$tick()
    return(TRUE)
  }
  
  # Assign a name attribute to the callback
  attr(progress_callback, "name") <- "progress_callback"
  
  # Train model with callback
  model <- xgb.train(
    params = list(
      objective = "binary:logistic",
      eval_metric = "logloss"
    ),
    data = dtrain,
    nrounds = nrounds,
    callbacks = list(progress_callback),
    verbose = 0  # Disable default verbosity
  )
  
  return(model)
}

# Call the function
xgb_model <- train_with_progress_bar(X_train, y_train, nrounds = 100)
# Predict on test set
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = as.numeric(y_test) - 1)
xgb_pred <- predict(xgb_model, dtest)
predictions <- ifelse(xgb_pred > 0.5, 1, 0)
y_test_numeric <- as.numeric(y_test) -1 
# Confusion matrix
cm_xgb <- confusionMatrix(factor(predictions, levels = c(0,1)), factor(y_test_numeric, levels = c(0,1)))
saveRDS(cm_xgb, "cm_xgb.rds")
print(cm_xgb$table)
##           Reference
## Prediction     0     1
##          0 36212  9993
##          1   227   198
# Accuracy
accuracy <- cm_xgb$overall["Accuracy"]
cat("Accuracy of model:", round(accuracy * 100, 2), "%\n")
## Accuracy of model: 78.08 %
# F1 Score
f1 <- F1_Score(factor(predictions, levels = c(0,1)), factor(y_test_numeric, levels = c(0,1)))
cat("F1 Score:", round(f1, 3), "\n")
## F1 Score: 0.876
# Recall (Sensitivity)
recall <- cm_xgb$byClass["Sensitivity"]
cat("Recall Score:", round(recall, 3), "\n")
## Recall Score: 0.994
# Balanced Accuracy
cat("Balanced Accuracy Score:", cm_xgb$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score: 0.5065997

Dealing with Imbalanced Data

SMOTE

SMOTE or Synthetic Minority Oversampling Technique is used to create synthetic data. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.

set.seed(27)       
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE) 
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]

X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
# Combine features and target for SMOTE
train_data <- cbind(X_train, target = y_train)

set.seed(27)       
smote_result <- SMOTE(X_train, y_train, K = 5, dup_size = 1)
X_train <- smote_result$data
y_train <- X_train$class
X_train$class <- NULL

Stochastic Gradient Descent with Modified Huber Loss

#' SGD with Modified Huber Loss Implementation
#'
#' @param X Matrix or data frame of features
#' @param y Vector of binary labels (0 or 1)
#' @param learning_rate Initial learning rate
#' @param epochs Number of epochs (passes through the data)
#' @param batch_size Size of mini-batches
#' @param alpha L2 regularization parameter
#' @param shuffle Whether to shuffle the data at each epoch
#' @param random_seed Random seed for reproducibility
#' @param verbose Whether to print progress
#'
#' @return List containing weights, bias, and training history
sgd_modified_huber <- function(X, y, learning_rate = 0.01, epochs = 100, 
                               batch_size = 32, alpha = 0.0001, 
                               shuffle = TRUE, random_seed = NULL, 
                               verbose = TRUE) {
  # Set random seed if provided
  if (!is.null(random_seed)) {
    set.seed(random_seed)
  }
  
  # Ensure X is a matrix
  if (is.data.frame(X)) {
    X <- as.matrix(X)
  }
  
  # Initialize parameters
  n_samples <- nrow(X)
  n_features <- ncol(X)
  w <- rep(0, n_features)  
  b <- 0                   
  
  # Initialize training history
  history <- data.frame(epoch = integer(), loss = numeric())
  
  # Modified Huber loss gradient function
  modified_huber_gradient <- function(y_true, y_pred) {
    # Calculate margin = y * (w*x + b)
    margin <- y_true * y_pred
    
    # Calculate gradients based on the margin
    gradients <- rep(0, length(margin))
    
    case1 <- margin <= -1
    case2 <- margin > -1 & margin < 1
    
    gradients[case1] <- -y_true[case1]
    gradients[case2] <- -y_true[case2] * (1 - margin[case2])
    
    return(gradients)
  }
  
  # Shuffle function
  shuffle_data <- function(X, y) {
    idx <- sample(n_samples)
    return(list(X = X[idx,], y = y[idx]))
  }
  
  # Training loop
  for (epoch in 1:epochs) {
    epoch_loss <- 0
    
    # Shuffle data if requested
    if (shuffle) {
      shuffled <- shuffle_data(X, y)
      X_epoch <- shuffled$X
      y_epoch <- shuffled$y
    } else {
      X_epoch <- X
      y_epoch <- y
    }
    
    # Process mini-batches
    for (i in seq(1, n_samples, by = batch_size)) {
      # Get mini-batch
      end_idx <- min(i + batch_size - 1, n_samples)
      X_batch <- X_epoch[i:end_idx, , drop = FALSE]
      y_batch <- y_epoch[i:end_idx]
      
      # Convert y to {-1, 1} for modified Huber loss
      y_binary <- 2 * y_batch - 1
      
      # Forward pass
      scores <- X_batch %*% w + b
      
      # Compute gradients
      grads <- modified_huber_gradient(y_binary, scores)
      
      # Update parameters
      grad_w <- t(X_batch) %*% grads / length(y_batch) + alpha * w
      grad_b <- mean(grads)
      
      w <- w - learning_rate * grad_w
      b <- b - learning_rate * grad_b
      
      # Compute loss for monitoring
      loss_batch <- calculate_modified_huber_loss(y_binary, scores)
      epoch_loss <- epoch_loss + loss_batch * length(y_batch)
    }
    
    # Average loss for the epoch
    epoch_loss <- epoch_loss / n_samples
    
    # Save history
    history <- rbind(history, data.frame(epoch = epoch, loss = epoch_loss))
    
    # Print progress
    if (verbose && epoch %% 10 == 0) {
      cat(sprintf("Epoch %d/%d - Loss: %.4f\n", epoch, epochs, epoch_loss))
    }
  }
  
  # Return model parameters and history
  return(list(
    weights = w,
    bias = b,
    history = history
  ))
}

# Helper function to calculate modified Huber loss
calculate_modified_huber_loss <- function(y_true, scores) {
  margin <- y_true * scores
  
  # Calculate loss based on the margin
  losses <- rep(0, length(margin))
  
  case1 <- margin <= -1
  case2 <- margin > -1 & margin < 1
  
  losses[case1] <- -4 * margin[case1]
  losses[case2] <- (1 - margin[case2])^2
  
  return(mean(losses))
}

# Prediction function
predict_sgd <- function(X, model, threshold = 0.5) {
  if (is.data.frame(X)) {
    X <- as.matrix(X)
  }
  
  # Calculate raw scores
  scores <- X %*% model$weights + model$bias
  
  # Convert scores to probabilities (simplified sigmoid)
  probs <- 1 / (1 + exp(-scores))
  
  # Return class predictions
  return(ifelse(probs > threshold, 1, 0))
}
y_train <- as.numeric(y_train)
# Fit the model
model <- sgd_modified_huber(
   X = X_train,
   y = y_train,
   learning_rate = 0.01,
   epochs = 100,
   batch_size = 32,
   alpha = 0.0001,
   shuffle = TRUE,
   random_seed = 101,
   verbose = TRUE
 )
## Epoch 10/100 - Loss: 0.8784
## Epoch 20/100 - Loss: 0.8784
## Epoch 30/100 - Loss: 0.8782
## Epoch 40/100 - Loss: 0.8783
## Epoch 50/100 - Loss: 0.8785
## Epoch 60/100 - Loss: 0.8783
## Epoch 70/100 - Loss: 0.8783
## Epoch 80/100 - Loss: 0.8782
## Epoch 90/100 - Loss: 0.8786
## Epoch 100/100 - Loss: 0.8787
# # Make predictions
 y_pred <- predict_sgd(X_test, model)
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
cm_sgd_smote <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(cm_sgd_smote$table)
##           Reference
## Prediction     0     1
##          0 43371 11101
##          1  2431  1385
print(paste0(round(cm_sgd_smote$overall["Accuracy"], 2) * 100, "%"))
## [1] "77%"
# Accuracy
cat("Accuracy of model", cm_sgd_smote$overall["Accuracy"], "\n")
## Accuracy of model 0.7678424
# F1 Score
cat("F1 Score", cm_sgd_smote$byClass["F1"], "\n")
## F1 Score 0.8650498
# Recall Score
cat("Recall Score", cm_sgd_smote$byClass["Sensitivity"], "\n")
## Recall Score 0.9469237
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_sgd_smote$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.528924

Decision Tree Classifier

set.seed(101)  # Equivalent to random_state in Python

# Train the decision tree
dtree <- rpart(
  formula = y_train ~ .,        
  data = cbind(X_train, y_train = y_train), 
  method = "class",            
  control = rpart.control(
    maxdepth = 10,              
    minsplit = 60,              
    minbucket = 30,             
    cp = 0.001                  
  )
)
# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")

# Ensure predictions and actual values are factors with the same levels
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))

# Calculate confusion matrix
cm_smote_dt <- confusionMatrix(dtree_pred_factor, y_test_factor)
saveRDS(cm_smote_dt, "cm_smote_dt.rds")
print(cm_smote_dt$table)
##           Reference
## Prediction     0     1
##          0 43661 11408
##          1  2141  1078
# Calculate accuracy
accuracy <- cm_smote_dt$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "76.76%"
# Accuracy
cat("Accuracy of model", cm_smote_dt$overall["Accuracy"], "\n")
## Accuracy of model 0.7675508
# F1 Score
cat("F1 Score", cm_smote_dt$byClass["F1"], "\n")
## F1 Score 0.8656799
# Recall Score
cat("Recall Score", cm_smote_dt$byClass["Sensitivity"], "\n")
## Recall Score 0.9532553
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_smote_dt$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.519796

Random Forest Classifier

set.seed(42)  

if(!is.factor(y_train)) {
  y_train_factor <- factor(y_train)
} else {
  y_train_factor <- y_train
}

# Train the Random Forest model
rfc <- randomForest(
  x = X_train,             
  y = y_train_factor,       
  ntree = 10,              
  importance = TRUE        
)

# Predict on test set
rfc_pred <- predict(rfc, X_test)
rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)

# Calculate confusion matrix
conf_matrix <- confusionMatrix(rfc_pred_factor, y_test_factor)
print(conf_matrix$table)
##           Reference
## Prediction     0     1
##          0 42497 10887
##          1  3305  1599
# Calculate accuracy and print as percentage
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 75.65%"
# Extract and print the metrics 
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7565194
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8569153
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.9278416
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5279525

With SMOTE procedure applied, the best results, considering Accuracy, F1 Score and Recall, were obtained by Decision Tree Classifier model.

Upsampling

Upsampling can be defined as adding more copies of the minority class. Upsampling can be a good choice when you don’t have a ton of data to work with. (Not a good choice here though)

y <- train_dummy["LOAN_DEFAULT_1"]

X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]

set.seed(27)       
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE) 
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]

X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X <- cbind(X_train, LOAN_DEFAULT = y_train)
X <- as.data.frame(X)

not_fraud <- X %>% filter(LOAN_DEFAULT_1 == 0)
fraud <- X %>% filter(LOAN_DEFAULT_1 == 1)

# Print class distribution
cat("Distribution before resampling:\n")
## Distribution before resampling:
cat("Non-fraud samples:", nrow(not_fraud), "\n")
## Non-fraud samples: 136741
cat("Fraud samples:", nrow(fraud), "\n")
## Fraud samples: 38125
# Load required libraries
set.seed(27)

# Upsample the minority class (fraud) with replacement
fraud_upsampled <- fraud %>%
  slice_sample(n = nrow(not_fraud), replace = TRUE)

# Combine majority class and upsampled minority class
upsampled <- bind_rows(not_fraud, fraud_upsampled)

# Check new class counts
upsampled_counts <- upsampled %>%
  count(LOAN_DEFAULT_1)
print(upsampled_counts)
##   LOAN_DEFAULT_1      n
## 1              0 136741
## 2              1 136741
# Separate features and target
y_train <- upsampled$LOAN_DEFAULT_1
X_train <- upsampled %>% select(-LOAN_DEFAULT_1)

Decision Tree Classifier

set.seed(101)  

# Train the decision tree
dtree <- rpart(
  formula = y_train ~ .,       
  data = cbind(X_train, y_train = y_train), 
  method = "class",            
  control = rpart.control(
    maxdepth = 10,             
    minsplit = 60,             
    minbucket = 30,             
    cp = 0.001                  
  )
)

# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")

# Ensure predictions and actual values are factors with the same levels
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))

# Calculate confusion matrix
conf_matrix <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(conf_matrix$table)
##           Reference
## Prediction     0     1
##          0 25113  4663
##          1 20689  7823
# Calculate accuracy
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "56.51%"
# Extract and print the metrics (equivalent to the Python code)
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.5650563
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.6645585
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.5482948
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5874183

Stochastic Gradient Descent with Modified Huber Loss

# Fit the model
model <- sgd_modified_huber(
   X = X_train,
   y = y_train,
   learning_rate = 0.01,
   epochs = 100,
   batch_size = 32,
   alpha = 0.0001,
   shuffle = TRUE,
   random_seed = 101,
   verbose = TRUE
 )
## Epoch 10/100 - Loss: 0.9528
## Epoch 20/100 - Loss: 0.9527
## Epoch 30/100 - Loss: 0.9529
## Epoch 40/100 - Loss: 0.9528
## Epoch 50/100 - Loss: 0.9528
## Epoch 60/100 - Loss: 0.9526
## Epoch 70/100 - Loss: 0.9529
## Epoch 80/100 - Loss: 0.9527
## Epoch 90/100 - Loss: 0.9526
## Epoch 100/100 - Loss: 0.9527
#
# Make predictions
 y_pred <- predict_sgd(X_test, model)
 conf_matrix <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(conf_matrix$table)
##           Reference
## Prediction     0     1
##          0 24190  4317
##          1 21612  8169
print(paste0(round(conf_matrix$overall["Accuracy"], 2) * 100, "%"))
## [1] "56%"
# Extract and print the metrics (equivalent to the Python code)
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.5551572
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.6510651
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.5281429
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5911978

Random Forest Classifier

set.seed(42)

if(!is.factor(y_train)) {
  y_train_factor <- factor(y_train)
} else {
  y_train_factor <- y_train
}

# Train the Random Forest model
rfc <- randomForest(
  x = X_train,            
  y = y_train_factor,       
  ntree = 10,               
  importance = TRUE         
)

# Predict on test set
rfc_pred <- predict(rfc, X_test)

rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)

# Calculate confusion matrix
cm_upsampling_rf <- confusionMatrix(rfc_pred_factor, y_test_factor)
saveRDS(cm_upsampling_rf, "cm_upsampling_rf.rds")
print(cm_upsampling_rf$table)
##           Reference
## Prediction     0     1
##          0 31983  6946
##          1 13819  5540
# Calculate accuracy and print as percentage
accuracy <- cm_upsampling_rf$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 64.38%"
# Extract and print the metrics
# Accuracy
cat("Accuracy of model", cm_upsampling_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.6437517
# F1 Score
cat("F1 Score", cm_upsampling_rf$byClass["F1"], "\n")
## F1 Score 0.7549303
# Recall Score
cat("Recall Score", cm_upsampling_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.6982883
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_upsampling_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5709926

With upsampling procedure applied, the best results, considering Accuracy, F1 Score and Recall, were obtained by Random Forest Classifier model.

Downsampling

Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.

y <- train_dummy["LOAN_DEFAULT_1"]

X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]

set.seed(27)      
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)  
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]

X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0

X <- cbind(X_train, LOAN_DEFAULT = y_train)
X <- as.data.frame(X)

not_fraud <- X %>% filter(LOAN_DEFAULT_1 == 0)
fraud <- X %>% filter(LOAN_DEFAULT_1 == 1)
set.seed(27)

# Upsample the minority class (fraud) with replacement
fraud_downsampled <- fraud %>%
  slice_sample(n = nrow(not_fraud), replace = FALSE)

# Combine majority class and upsampled minority class
downsampled <- bind_rows(not_fraud, fraud_downsampled)

# Check new class counts
downsampled_counts <- downsampled %>%
  count(LOAN_DEFAULT_1)
print(downsampled_counts)
##   LOAN_DEFAULT_1      n
## 1              0 136741
## 2              1  38125
y_train <- downsampled$LOAN_DEFAULT_1
X_train <- downsampled %>% select(-LOAN_DEFAULT_1)

Decision Tree Classifier

set.seed(101) 

# Train the decision tree
dtree <- rpart(
  formula = y_train ~ .,        
  data = cbind(X_train, y_train = y_train),  
  method = "class",             
  control = rpart.control(
    maxdepth = 10,              
    minsplit = 60,              
    minbucket = 30,             
    cp = 0.001                  
  )
)

# Predict on test set
dtree_pred <- predict(dtree, X_test, type = "class")

y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))

# Calculate confusion matrix
conf_matrix <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(conf_matrix$table)
##           Reference
## Prediction     0     1
##          0 45802 12486
##          1     0     0
# Calculate accuracy
accuracy <- conf_matrix$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "78.58%"
# Extract and print the metrics 
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7857878
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8800461
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 1
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5

Stochastic Gradient Descent with Modified Huber Loss

model <- sgd_modified_huber(
   X = X_train,
   y = y_train,
   learning_rate = 0.01,
   epochs = 100,
   batch_size = 32,
   alpha = 0.0001,
   shuffle = TRUE,
   random_seed = 101,
   verbose = TRUE
 )
## Epoch 10/100 - Loss: 0.6618
## Epoch 20/100 - Loss: 0.6617
## Epoch 30/100 - Loss: 0.6620
## Epoch 40/100 - Loss: 0.6615
## Epoch 50/100 - Loss: 0.6621
## Epoch 60/100 - Loss: 0.6616
## Epoch 70/100 - Loss: 0.6619
## Epoch 80/100 - Loss: 0.6619
## Epoch 90/100 - Loss: 0.6616
## Epoch 100/100 - Loss: 0.6619
 y_pred <- predict_sgd(X_test, model)
 conf_matrix <- confusionMatrix(factor(y_pred, levels = c(0,1)), y_test_factor)
print(conf_matrix$table)
##           Reference
## Prediction     0     1
##          0 45710 12415
##          1    92    71
print(paste0(round(conf_matrix$overall["Accuracy"], 2) * 100, "%"))
## [1] "79%"
# Extract and print the metrics 
# Accuracy
cat("Accuracy of model", conf_matrix$overall["Accuracy"], "\n")
## Accuracy of model 0.7854275
# F1 Score
cat("F1 Score", conf_matrix$byClass["F1"], "\n")
## F1 Score 0.8796559
# Recall Score
cat("Recall Score", conf_matrix$byClass["Sensitivity"], "\n")
## Recall Score 0.9979914
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", conf_matrix$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5018389

Random Forest Classifier

set.seed(42) 

if(!is.factor(y_train)) {
  y_train_factor <- factor(y_train)
} else {
  y_train_factor <- y_train
}

# Train the Random Forest model
rfc <- randomForest(
  x = X_train,              
  y = y_train_factor,       
  ntree = 10,               
  importance = TRUE        
)

# Predict on test set
rfc_pred <- predict(rfc, X_test)

rfc_pred_factor <- factor(rfc_pred, levels = levels(y_test_factor))
y_test_factor <- factor(y_test$LOAN_DEFAULT_1)

# Calculate confusion matrix
cm_downsampling_rf <- confusionMatrix(rfc_pred_factor, y_test_factor)
saveRDS(cm_downsampling_rf, "cm_downsampling_rf.rds")
print(cm_downsampling_rf$table)
##           Reference
## Prediction     0     1
##          0 44965 12068
##          1   837   418
accuracy <- cm_downsampling_rf$overall["Accuracy"]
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 77.86%"
# Extract and print the metrics 
# Accuracy
cat("Accuracy of model", cm_downsampling_rf$overall["Accuracy"], "\n")
## Accuracy of model 0.7785994
# F1 Score
cat("F1 Score", cm_downsampling_rf$byClass["F1"], "\n")
## F1 Score 0.8745077
# Recall Score
cat("Recall Score", cm_downsampling_rf$byClass["Sensitivity"], "\n")
## Recall Score 0.9817257
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_downsampling_rf$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5076016

With downsamling procedure applied, based on the lowest number of false negative the Random Forest Classifier obtained the best results.

PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique used in statistics and machine learning to reduce the number of features (dimensions) in a dataset while preserving as much of its variance (information) as possible.

y <- train_dummy["LOAN_DEFAULT_1"]

X <- train_dummy[, setdiff(names(train_dummy), "LOAN_DEFAULT_1")]
X$`EMPLOYMENT_TYPE_Self employed`[is.na(X$`EMPLOYMENT_TYPE_Self employed`)] <- 0

set.seed(27)      
train_indices <- createDataPartition(y$LOAN_DEFAULT_1, p = 0.75, list = FALSE)  
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices,]
y_test <- y[-train_indices,]

X_train$`EMPLOYMENT_TYPE_Self employed`[is.na(X_train$`EMPLOYMENT_TYPE_Self employed`)] <- 0
X_test$`EMPLOYMENT_TYPE_Self employed`[is.na(X_test$`EMPLOYMENT_TYPE_Self employed`)] <- 0
pca_result <- prcomp(X_train, 
                     center = TRUE,  
                     scale. = FALSE)

variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
plot(variance_explained, 
     xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained",
     type = "b",
     main = "Scree Plot")

# Cumulative variance explained
cumulative_variance <- cumsum(variance_explained)
plot(cumulative_variance, 
     xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b",
     main = "Cumulative Variance Explained")

We can see that 17 components are enough to explain most of the variance. Let’s see the PCA with 17 components then.

pca_result <- prcomp(X_train, 
                     center = TRUE,  
                     scale. = FALSE,
                     rank. = 17)  
X_train_PCA <- as.data.frame(pca_result$x)

Decision Tree Classifier

set.seed(101)  

y_train <- y_train$LOAN_DEFAULT_1
# Train the decision tree
dtree <- rpart(
  formula = y_train ~ .,        
  data = cbind(X_train_PCA, y_train = y_train), 
  method = "class",             
  control = rpart.control(
    maxdepth = 10,              
    minsplit = 60,              
    minbucket = 30,             
    cp = 0.001                  
  )
)

# Predict on test set
X_test_PCA <- predict(pca_result, newdata = X_test)
X_test_PCA <- as.data.frame(X_test_PCA)

dtree_pred <- predict(dtree, X_test_PCA, type = "class")

y_test_factor <- factor(y_test$LOAN_DEFAULT_1)
dtree_pred_factor <- factor(dtree_pred, levels = levels(y_test_factor))

# Calculate confusion matrix
cm_pca <- confusionMatrix(dtree_pred_factor, y_test_factor)
print(cm_pca$table)
##           Reference
## Prediction     0     1
##          0 45802 12486
##          1     0     0
# Calculate accuracy
accuracy <- cm_pca$overall["Accuracy"]
print(paste0(round(accuracy * 100, 2), "%"))
## [1] "78.58%"
# Accuracy
cat("Accuracy of model", cm_pca$overall["Accuracy"], "\n")
## Accuracy of model 0.7857878
# F1 Score
cat("F1 Score", cm_pca$byClass["F1"], "\n")
## F1 Score 0.8800461
# Recall Score
cat("Recall Score", cm_pca$byClass["Sensitivity"], "\n")
## Recall Score 1
# Balanced Accuracy Score 
cat("Balanced Accuracy Score", cm_pca$byClass["Balanced Accuracy"], "\n")
## Balanced Accuracy Score 0.5

Conclusions

extract_caret_metrics <- function(conf_matrix_obj) {
  # Extract the standard metrics provided by caret
  accuracy <- conf_matrix_obj$overall["Accuracy"]
  
  # Extract class-specific statistics
  class_stats <- conf_matrix_obj$byClass
  
  # For binary classification, we're interested in the positive class metrics
  precision <- class_stats["Pos Pred Value"]  # Precision = Positive Predictive Value
  recall <- class_stats["Sensitivity"]        # Recall = Sensitivity
  specificity <- class_stats["Specificity"]
  f1_score <- class_stats["F1"]
  
  return(data.frame(
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1_score,
    Specificity = specificity
  ))
}
# Function to read RDS files containing caret confusionMatrix objects
read_caret_model_results <- function(file_path, model_name) {
  # Read the RDS file
  result <- readRDS(file_path)
  
  # Check if it's a caret confusionMatrix object
  if (class(result)[1] == "confusionMatrix") {
    conf_matrix_obj <- result
  } else {
    stop(paste("File does not contain a caret confusionMatrix object:", file_path))
  }
  
  # Extract the actual confusion matrix table
  conf_matrix <- conf_matrix_obj$table
  
  # Extract metrics
  metrics <- extract_caret_metrics(conf_matrix_obj)
  
  return(list(
    model_name = model_name,
    confusion_matrix = conf_matrix,
    conf_matrix_obj = conf_matrix_obj,  # Keep the full object for reference
    metrics = metrics
  ))
}
# Function to plot confusion matrix
plot_confusion_matrix <- function(conf_matrix, title) {
  # Convert matrix to data frame for plotting
  conf_df <- as.data.frame(as.table(conf_matrix))
  names(conf_df) <- c("Actual", "Predicted", "Frequency")
  
  # Calculate percentages for text labels
  total <- sum(conf_df$Frequency)
  conf_df$Percentage <- conf_df$Frequency / total * 100
  
  # Calculate row totals for row-wise percentages
  row_totals <- aggregate(Frequency ~ Actual, conf_df, sum)
  conf_df <- merge(conf_df, row_totals, by = "Actual", suffixes = c("", "_total"))
  conf_df$Row_Percentage <- conf_df$Frequency / conf_df$Frequency_total * 100
  
  # Create plot
  ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Frequency)) +
    geom_tile() +
    geom_text(aes(label = sprintf("%.0f\n(%.1f%%)", Frequency, Row_Percentage)), 
              color = "white", size = 4) +
    scale_fill_gradient(low = "steelblue", high = "darkblue") +
    labs(
      title = title,
      x = "Predicted Class",
      y = "Actual Class"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", hjust = 0.5),
      axis.text = element_text(size = 12),
      legend.position = "none"
    )
}
# Function to plot comparison of metrics across models
plot_metrics_comparison <- function(all_metrics) {
  # Convert to long format for plotting
  metrics_long <- all_metrics %>%
    pivot_longer(cols = c("Accuracy", "Precision", "Recall", "F1_Score", "Specificity"),
                 names_to = "Metric", values_to = "Value")
  
  # Create the plot
  ggplot(metrics_long, aes(x = Model, y = Value, fill = Model)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = sprintf("%.3f", Value)), position = position_stack(vjust = 0.5),
              color = "white") +
    facet_wrap(~ Metric, scales = "free_y") +
    scale_y_continuous(labels = percent_format(accuracy = 1)) +
    labs(
      title = "Model Comparison by Performance Metrics",
      x = "Model",
      y = "Score"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
      axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
      legend.position = "none",
      strip.text = element_text(face = "bold", size = 12)
    )
}
# Main function to compare models
compare_caret_models <- function(file_paths, model_names) {
  # Check that the lists have the same length
  if (length(file_paths) != length(model_names)) {
    stop("The number of file paths and model names must be the same")
  }
  
  # Read all model results
  model_results <- list()
  for (i in 1:length(file_paths)) {
    tryCatch({
      model_results[[i]] <- read_caret_model_results(file_paths[i], model_names[i])
      cat("Successfully loaded: ", model_names[i], "\n")
    }, error = function(e) {
      cat("Error loading ", model_names[i], ": ", e$message, "\n")
    })
  }
  
  # Extract metrics for all models
  all_metrics <- data.frame()
  for (result in model_results) {
    metrics_df <- result$metrics
    metrics_df$Model <- result$model_name
    all_metrics <- rbind(all_metrics, metrics_df)
  }
  
  # Create plots for confusion matrices
  conf_plots <- list()
  for (i in 1:length(model_results)) {
    conf_plots[[i]] <- plot_confusion_matrix(
      model_results[[i]]$confusion_matrix, 
      paste("Confusion Matrix -", model_results[[i]]$model_name)
    )
  }
  
  # Plot metrics comparison
  metrics_plot <- plot_metrics_comparison(all_metrics)
  
  # Print metrics table sorted by F1 Score
  cat("\n================= MODEL COMPARISON SUMMARY =================\n")
  print(all_metrics %>% 
          arrange(desc(F1_Score)) %>% 
          select(Model, Accuracy, Precision, Recall, F1_Score, Specificity) %>%
          mutate_if(is.numeric, round, 4))
  
  # Create a summary table for statistical comparison
  cat("\n================= DETAILED MODEL STATISTICS =================\n")
  for (i in 1:length(model_results)) {
    cat("\n", model_names[i], ":\n")
    print(model_results[[i]]$conf_matrix_obj$overall)
    cat("\nClass-specific statistics:\n")
    print(model_results[[i]]$conf_matrix_obj$byClass)
    cat("\n-------------------------------------------\n")
  }
  
  # Determine best model based on F1 Score
  best_model <- all_metrics %>%
    arrange(desc(F1_Score)) %>%
    dplyr::slice(1)
  
  cat("\n====================== BEST MODEL ======================\n")
  cat("Based on F1 Score: ", best_model$Model, "\n")
  cat("F1 Score: ", round(best_model$F1_Score, 4), "\n")
  cat("Accuracy: ", round(best_model$Accuracy, 4), "\n")
  cat("Recall (Sensitivity): ", round(best_model$Recall, 4), "\n")
  cat("Precision (Positive Predictive Value): ", round(best_model$Precision, 4), "\n")
  
  # Return results
  return(list(
    metrics = all_metrics,
    confusions_plots = conf_plots,
    metrics_plot = metrics_plot,
    best_model = best_model$Model,
    model_results = model_results  # Return full model results for additional analysis
  ))
}
# Function to process all RDS files in a directory
batch_process_caret_models <- function(directory_path) {
  # Get all RDS files in the directory
  rds_files <- list.files(directory_path, pattern = "\\.rds$", full.names = TRUE)
  
  if (length(rds_files) == 0) {
    cat("No RDS files found in directory:", directory_path, "\n")
    return(NULL)
  }
  
  # Extract model names from filenames (remove path and extension)
  model_names <- gsub("\\.rds$", "", basename(rds_files))
  
  # Compare models
  results <- compare_caret_models(rds_files, model_names)
  
  # Save results automatically
  output_dir <- file.path(directory_path, "model_comparison_results")
  if (!dir.exists(output_dir)) {
    dir.create(output_dir)
  }
  
  # Save metrics comparison plot
  metrics_filename <- file.path(output_dir, "metrics_comparison.png")
  ggsave(metrics_filename, results$metrics_plot, width = 10, height = 8)
  
  # Save confusion matrix plots
  for (i in 1:length(results$confusion_plots)) {
    conf_filename <- file.path(output_dir, paste0("confusion_matrix_", model_names[i], ".png"))
    ggsave(conf_filename, results$confusion_plots[[i]], width = 7, height = 6)
  }
  
  # Create a combined confusion matrix plot
  #combined_plot <- gridExtra::grid.arrange(grobs = results$confusion_plots, ncol = 2)
  #ggsave(file.path(output_dir, "combined_confusion_matrices.png"), combined_plot, width = 14, height = 10)
  
  # Save metrics table
  write.csv(results$metrics, file.path(output_dir, "metrics_summary.csv"), row.names = FALSE)
  
  cat("\nResults saved to directory:", output_dir, "\n")
  return(results)
}
path <- getwd()
results <- batch_process_caret_models(path)
## Successfully loaded:  cm_downsampling_rf 
## Successfully loaded:  cm_dt 
## Successfully loaded:  cm_log 
## Successfully loaded:  cm_nb 
## Successfully loaded:  cm_rf 
## Successfully loaded:  cm_sgd 
## Successfully loaded:  cm_smote_dt 
## Successfully loaded:  cm_upsampling_rf 
## Successfully loaded:  cm_xgb 
## 
## ================= MODEL COMPARISON SUMMARY =================
##                        Model Accuracy Precision Recall F1_Score Specificity
## Accuracy2             cm_log   0.7815    0.7821 0.9987   0.8772      0.0052
## Accuracy8             cm_xgb   0.7808    0.7837 0.9938   0.8763      0.0194
## Accuracy  cm_downsampling_rf   0.7786    0.7884 0.9817   0.8745      0.0335
## Accuracy4              cm_rf   0.7758    0.7849 0.9822   0.8726      0.0377
## Accuracy6        cm_smote_dt   0.7676    0.7928 0.9533   0.8657      0.0863
## Accuracy7   cm_upsampling_rf   0.6438    0.8216 0.6983   0.7549      0.4437
## Accuracy3              cm_nb   0.4242    0.8404 0.3249   0.4686      0.7794
## Accuracy1              cm_dt   0.7789    0.3819 0.0190   0.0363      0.9914
## Accuracy5             cm_sgd   0.7814    0.0000 0.0000      NaN      1.0000
## 
## ================= DETAILED MODEL STATISTICS =================
## 
##  cm_downsampling_rf :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.77859937     0.02259487     0.77520638     0.78196460     0.78578781 
## AccuracyPValue  McnemarPValue 
##     0.99998774     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##           0.98172569           0.03347749           0.78840321 
##       Neg Pred Value            Precision               Recall 
##           0.33306773           0.78840321           0.98172569 
##                   F1           Prevalence       Detection Rate 
##           0.87450771           0.78578781           0.77142808 
## Detection Prevalence    Balanced Accuracy 
##           0.97846898           0.50760159 
## 
## -------------------------------------------
## 
##  cm_dt :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.77887626     0.01583981     0.77508148     0.78263629     0.78144971 
## AccuracyPValue  McnemarPValue 
##     0.91140137     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##          0.019036405          0.991382859          0.381889764 
##       Neg Pred Value            Precision               Recall 
##          0.783248775          0.381889764          0.019036405 
##                   F1           Prevalence       Detection Rate 
##          0.036265072          0.218550290          0.004160412 
## Detection Prevalence    Balanced Accuracy 
##          0.010894274          0.505209632 
## 
## -------------------------------------------
## 
##  cm_log :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##    0.781535492    0.005992706    0.777756880    0.785279023    0.781449710 
## AccuracyPValue  McnemarPValue 
##    0.484775934    0.000000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##          0.998655287          0.005200667          0.782109697 
##       Neg Pred Value            Precision               Recall 
##          0.519607843          0.782109697          0.998655287 
##                   F1           Prevalence       Detection Rate 
##          0.877216243          0.781449710          0.780398885 
## Detection Prevalence    Balanced Accuracy 
##          0.997812567          0.501927977 
## 
## -------------------------------------------
## 
##  cm_nb :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.42421188     0.05826022     0.41972021     0.42871299     0.78144971 
## AccuracyPValue  McnemarPValue 
##     1.00000000     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##            0.3248717            0.7794132            0.8404089 
##       Neg Pred Value            Precision               Recall 
##            0.2440696            0.8404089            0.3248717 
##                   F1           Prevalence       Detection Rate 
##            0.4685997            0.7814497            0.2538709 
## Detection Prevalence    Balanced Accuracy 
##            0.3020802            0.5521425 
## 
## -------------------------------------------
## 
##  cm_rf :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.77580956     0.02946235     0.77199642     0.77958834     0.78144971 
## AccuracyPValue  McnemarPValue 
##     0.99838371     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##           0.98224430           0.03768031           0.78492949 
##       Neg Pred Value            Precision               Recall 
##           0.37245393           0.78492949           0.98224430 
##                   F1           Prevalence       Detection Rate 
##           0.87257125           0.78144971           0.76757452 
## Detection Prevalence    Balanced Accuracy 
##           0.97788977           0.50996230 
## 
## -------------------------------------------
## 
##  cm_sgd :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.814283e-01  -4.288847e-05   7.776490e-01   7.851725e-01   7.814497e-01 
## AccuracyPValue  McnemarPValue 
##   5.071246e-01   0.000000e+00 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##         0.000000e+00         9.999726e-01         0.000000e+00 
##       Neg Pred Value            Precision               Recall 
##         7.814450e-01         0.000000e+00         0.000000e+00 
##                   F1           Prevalence       Detection Rate 
##                  NaN         2.185503e-01         0.000000e+00 
## Detection Prevalence    Balanced Accuracy 
##         2.144542e-05         4.999863e-01 
## 
## -------------------------------------------
## 
##  cm_smote_dt :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.76755078     0.05423058     0.76409986     0.77097504     0.78578781 
## AccuracyPValue  McnemarPValue 
##     1.00000000     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9532553            0.0863367            0.7928417 
##       Neg Pred Value            Precision               Recall 
##            0.3348866            0.7928417            0.9532553 
##                   F1           Prevalence       Detection Rate 
##            0.8656799            0.7857878            0.7490564 
## Detection Prevalence    Balanced Accuracy 
##            0.9447742            0.5197960 
## 
## -------------------------------------------
## 
##  cm_upsampling_rf :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.6437517      0.1183014      0.6398483      0.6476408      0.7857878 
## AccuracyPValue  McnemarPValue 
##      1.0000000      0.0000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6982883            0.4436969            0.8215726 
##       Neg Pred Value            Precision               Recall 
##            0.2861718            0.8215726            0.6982883 
##                   F1           Prevalence       Detection Rate 
##            0.7549303            0.7857878            0.5487064 
## Detection Prevalence    Balanced Accuracy 
##            0.6678733            0.5709926 
## 
## -------------------------------------------
## 
##  cm_xgb :
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##     0.78082779     0.02015605     0.77704486     0.78457574     0.78144971 
## AccuracyPValue  McnemarPValue 
##     0.62986861     0.00000000 
## 
## Class-specific statistics:
##          Sensitivity          Specificity       Pos Pred Value 
##           0.99377041           0.01942891           0.78372471 
##       Neg Pred Value            Precision               Recall 
##           0.46588235           0.78372471           0.99377041 
##                   F1           Prevalence       Detection Rate 
##           0.87633706           0.78144971           0.77658160 
## Detection Prevalence    Balanced Accuracy 
##           0.99088570           0.50659966 
## 
## -------------------------------------------
## 
## ====================== BEST MODEL ======================
## Based on F1 Score:  cm_log 
## F1 Score:  0.8772 
## Accuracy:  0.7815 
## Recall (Sensitivity):  0.9987 
## Precision (Positive Predictive Value):  0.7821
## 
## Results saved to directory: /Users/joannamisiak/Desktop/STUDIA/Reproducible_Research/RR_project/model_comparison_results

Based on our code the best model results were obtained by the Logistic Regression Model. Original study’s best model was Random Forest model with applied SMOTE. The results are much different, which can be explained by different specification of the implemented functions used for modelling.